RANDOM  EFFECTS  MODELS  FOR  NOMINAL 
AND  ORDINAL  DATA 


By 

JONATHAN  SETH  HARTZEL 


A DISSERTATION  PRESENTED  TO  THE  GRADUATE  SCHOOL 
OF  THE  UNIVERSITY  OF  FLORIDA  IN  PARTIAL  FULFILLMENT 
OF  THE  REQUIREMENTS  FOR  THE  DEGREE  OF 
DOCTOR  OF  PHILOSOPHY 

UNIVERSITY  OF  FLORIDA 


1999 


To  my  family,  Tracy,  Riley,  and  Kendi. 


ACKNOWLEDGEMENTS 


I would  like  to  express  my  sincere  gratitude  to  Dr.  Alan  Agresti  for  serving  as 
my  dissertation  advisor  and  for  providing  me  the  opportunity  to  work  with  him  as 
a research  assistant.  During  the  past  two  years  as  his  research  assistant,  I have 
gained  invaluable  experience  in  both  conducting  statistical  research  and  producing 
scholarly  writing.  His  obvious  interest  in  both  my  present  research  and  my  future  as 
a professional  has  been  a constant  encouragement  to  me.  I would  also  like  to  thank 
Drs.  Malay  Ghosh,  James  Hobert,  Ramon  Littell,  and  Gary  Miller  for  serving  on  my 
committee.  In  addition,  I thank  Dr.  Jane  Pendergast  for  her  time  on  my  committee 
before  leaving  the  University  of  Florida. 

I thank  the  Lord  for  giving  me  the  strength  and  the  perseverance  to  reach  this 
point,  and  for  giving  my  wife,  Tracy,  the  patience  and  understanding  through  it  all. 
Her  countless  sacrifices  for  me  and  unwavering  confidence  in  me  were  the  cornerstone 
for  my  success.  I am  forever  grateful. 

I thank  my  parents  for  their  love,  their  continual  support  of  my  educational  pur- 
suits, and  their  constant  prayers.  In  addition,  I thank  my  brother  and  sister  for  the 
many  cards  and  pictures  of  the  kids,  and  for  telling  them  all  about  Uncle  Jonathan 
and  Aunt  Tracy  while  we  have  been  away.  I also  want  to  thank  my  mother-  and 
father-in-law  for  their  immediate  love  and  support  from  the  first  time  I met  them. 
Their  visits  to  Gainesville  were  much  appreciated.  Finally,  I thank  Riley  and  Kendi 
for  always  being  excited  to  see  me  when  I came  home  from  school. 


TABLE  OF  CONTENTS 


£age 

ACKNOWLEDGEMENTS  iii 

ABSTRACT  vi 

CHAPTERS 

1 METHODS  FOR  MODELING  LONGITUDINAL  AND  CLUSTERED 

DATA 1 

1.1  Introduction 1 

1.2  Normal  Response  Data 4 

1.3  Non-Normal  Response  Data 6 

1.4  Outline 21 

2 MULTIVARIATE  GENERALIZED  LINEAR  MODELS 24 

2.1  Introduction 24 

2.2  Definition  24 

2.3  Maximum  Likelihood  Estimation 26 

2.4  Applications 30 

3 MULTIVARIATE  GENERALIZED  LINEAR  MIXED  MODELS  FOR 

NOMINAL  AND  ORDINAL  RESPONSE  DATA 38 

3.1  Introduction 38 

3.2  Multivariate  Generalized  Linear  Mixed  Models 42 

3.3  Maximum  Likelihood  Estimation 54 

3.4  Inference  and  Prediction  75 

3.5  Pseudo-Likelihood  Estimation 83 

3.6  Applications 91 

3.7  Cumulative  Logit  Models  with  Random  Thresholds  112 

4 NONPARAMETRIC  MAXIMUM  LIKELIHOOD  ESTIMATION  IN 

MULTIVARIATE  GENERALIZED  LINEAR  MIXED  MODELS  . 131 

4.1  Introduction 131 

4.2  Nonparametric  Maximum  Likelihood  Estimation 133 

4.3  Identifiability  156 

4.4  Application 169 

4.5  Simulation  Studies 174 


IV 


5 METHODS  FOR  ANALYZING  ORDINAL  MULTI-CENTER  CLIN- 


ICAL TRIAL  DATA 199 

5.1  Introduction 199 

5.2  Fixed  Effects  Approach 202 

5.3  Random  Effects  Approach  205 

5.4  Simulation  Study 217 

5.5  Score  Tests  for  a Common  Association  Parameter 227 

5.6  Application 245 

6 CONCLUSIONS  250 

6.1  Summary  of  Results 250 

6.2  Future  Research 254 

REFERENCES 256 

BIOGRAPHICAL  SKETCH 267 


Abstract  of  Dissertation  Presented  to  the  Graduate  School 
of  the  University  of  Florida  in  Partial  Fulfillment  of  the 
Requirements  for  the  Degree  of  Doctor  of  Philosophy 


RANDOM  EFFECTS  MODELS  FOR  NOMINAL 
AND  ORDINAL  DATA 


By 

Jonathan  Seth  Hartzel 
December  1999 

Chairman:  Alan  G.  Agresti 
Major  Department:  Statistics 

Models  for  nominal  and  ordinal  response  data  are  important  in  many  areas  of 
research.  In  medical  studies,  patients  are  often  evaluated  on  an  ordinal  or  graded 
scale.  Nominal  data,  such  as  types  of  services  used  at  a hospital,  are  frequent  in  the 
field  of  health  care.  It  is  often  the  case  that  such  data  are  nested  within  clusters 
or  repeatedly  assessed  over  time.  In  this  dissertation  we  propose  random  effects 
models  for  analyzing  longitudinal  or  clustered  nominal  and  ordinal  response  data. 
Specifically,  we  present  a general  multinomial  logit  random  effects  model  that  we 
motivate  within  the  framework  of  a multivariate  generalized  linear  mixed  model. 
As  special  cases  of  the  proposed  model,  we  consider  models  based  on  the  cumulative 
logit,  adjacent-category  logit,  and  continuation-ratio  logit  link  functions  for  analyzing 
ordinal  response  data,  and  the  baseline-category  logit  link  function  for  nominal  data. 

For  the  proposed  multinomial  random  effects  models,  we  consider  both  paramet- 
ric and  nonparametric  assumptions  for  the  distribution  of  the  random  effects.  In 


vi 


the  parametric  approach,  we  assume  that  the  random  effects  follow  a multivariate 
normal  distribution.  We  consider  direct  maximization  of  the  marginal  likelihood  us- 
ing adaptive  Gauss-Hermite  quadrature,  as  well  as  indirect  maximization  using  an 
automated  Monte  Carlo  expectation-maximization  (EM)  algorithm.  In  addition,  we 
propose  a pseudo-likelihood  approach  for  obtaining  approximate  maximum  likelihood 
estimates.  In  the  nonparametric  approach,  we  assume  that  the  random  effects  follow 
an  unspecified  discrete  distribution.  We  propose  an  EM  algorithm  for  obtaining  non- 
parametric maximum  likelihood  estimates  of  the  model  parameters  and  the  discrete 
distribution.  Using  simulation,  we  compare  the  performance  of  the  parametric  and 
nonparametric  approaches  when  the  random  effects  distribution  is  misspecified. 

We  also  examine  the  use  of  the  proposed  models  for  modeling  ordinal  multi-center 
clinical  trial  data.  We  consider  random  effects  models  that  allow  for  a common 
association  across  all  centers,  as  well  as  heterogeneous  associations.  We  also  propose 
Laplace  and  adaptive  Gauss-Hermite  quadrature  approximated  score  tests  for  testing 
that  a common  association  parameter  holds  in  the  heterogeneous  random  effects 
model.  We  show  that  the  latter  test  performs  poorly  for  small  to  moderate  numbers 
of  centers. 


vii 


CHAPTER  1 

METHODS  FOR  MODELING  LONGITUDINAL  AND  CLUSTERED  DATA 

1.1  Introduction 

In  many  areas  of  research  data  are  collected  on  the  same  experimental  unit  over 
time  or  under  multiple  conditions.  Longitudinal  studies  in  which  measurements  are 
taken  on  the  same  subject  over  time  occur  frequently  in  medical  and  biological  re- 
search. Repeated  measures  data  are  common  in  the  social  and  behavioral  sciences 
where  each  treatment  or  condition  is  applied  to  each  subject  in  an  effort  to  control 
between  subject  variability.  Data  may  also  be  collected  in  groups  or  clusters.  Surveys 
and  observational  studies  of  populations  that  have  a natural  hierarchical  structure, 
such  as  students  nested  within  classrooms,  lead  to  the  collection  of  clustered  data. 
In  multi-center  clinical  trials,  multiple  treatments  are  compared  at  different  sites  re- 
sulting in  data  that  are  clustered  at  the  center  level.  Regardless  of  the  sampling 
mechanism,  observations  taken  on  the  same  subject  or  from  the  same  cluster  are 
often  correlated.  Use  of  traditional  linear  or  generalized  linear  models  for  modeling 
and  inference  in  this  setting  would  be  incorrect  as  the  assumption  of  independent 
observations  is  violated.  Special  modeling  techniques  are  needed  that  can  account  for 
the  correlation  within  subjects  and  clusters. 

One  technique  for  incorporating  correlation  that  has  become  increasingly  popular 
involves  the  use  of  random  effects.  Models  that  incorporate  random  effects  appear 
throughout  the  literature  under  a variety  of  names  such  as  random  effects  models 
(Laird  and  Ware  1982;  Stiratelli  et  al.  1984),  mixed  effects  models  (Harville  and  Mee 
1984),  variance  component  models  (Harville  1977),  and  random  coefficient  models 
(Longford  1993).  The  basic  idea  underlying  a random  effects  model  is  that  hetero- 
geneity exists  across  subjects  in  all  or  some  subset  of  their  regression  coefficients.  This 


1 


2 


variability  may  be  attributable  to,  for  example,  unmeasured  covariates  or  imperfect 
measurement  of  measured  covariates.  It  is  assumed  that  the  heterogeneity  can  be  rep- 
resented by  a probability  distribution.  To  account  for  the  variability,  an  unobserved 
random  variable  from  the  probability  distribution  is  incorporated  additively  in  the 
model.  Since  observations  from  the  same  subject  share  the  same  unobserved  real- 
ization, correlation  is  induced  between  the  observations  within  the  subject.  Random 
effects  models  are  defined  conditionally  upon  the  random  effects.  Estimates  of  the 
fixed  and  random  parameters  are  obtained  by  maximizing  the  marginal  likelihood, 
which  requires  integrating  the  joint  likelihood  over  the  random  effects. 

Statistical  methods  for  correlated  data,  and  software  packages  for  implementing 
these  methods,  are  readily  available  when  the  data  consist  of  correlated  normal  re- 
sponses. In  this  case  when  the  random  effects  distribution  is  also  assumed  to  be 
normal,  estimation  of  both  the  fixed  and  random  effects  is  straightforward  as  the 
marginal  likelihood  can  be  written  in  closed  form.  A detailed  history  of  the  linear 
mixed  model  (LMM)  can  be  found  in  Searle  et  al.  (1992).  In  the  past  twenty  years 
there  has  been  a considerable  amount  of  research  in  the  area  of  correlated  data  where 
the  response  is  non-normal.  In  particular  much  attention  has  focused  on  correlated 
Poisson,  Bernoulli,  and  binomial  response  data.  A survey  article  by  Pendergast  et  al. 
(1996)  listed  well  over  one  hundred  references  for  the  analysis  of  correlated  binary 
data  alone!  Random  effects  models  in  this  context  are  often  derived  as  extensions 
of  generalized  linear  models  (GLMs)  (Nelder  and  Wedderburn  1972;  McCullagh  and 
Nelder  1989)  and  referred  to  as  generalized  linear  mixed  models  (GLMMs)  (Gilmour 
et  al.  1985).  In  contrast  to  linear  mixed  models,  the  assumption  of  a normal  random 
effects  distribution  leads  to  an  intractable  marginal  likelihood.  Thus  a majority  of  the 
recent  literature  in  this  area  has  focused  on  methods  for  approximating  the  marginal 
likelihood  (Zeger  and  Karim  1991;  McCulloch  1997;  Booth  and  Hobert  1999). 


3 


In  comparison,  there  has  been  relatively  little  research  for  nominal  and  ordinal 
response  data.  A nominal  response  is  a categorical  variable  with  unordered  levels, 
whereas  an  ordinal  response  has  ordered  levels.  Models  for  nominal  and  ordinal 
response  variables  assume  that  the  counts  within  each  category  of  the  response  follow 
a multinomial  distribution  for  each  combination  of  the  covariates. 

The  majority  of  the  work  for  multinomial  response  data  has  focused  on  random 
effects  models  for  ordinal  responses,  with  the  random  effects  assumed  to  be  normal. 
As  in  the  binary  case,  this  leads  to  an  intractable  marginal  likelihood.  Harville  and 
Mee  (1984)  were  among  the  first  to  have  proposed  a random  effects  model  for  ordinal 
data,  fitting  the  model  using  a best  linear  unbiased  prediction  procedure  (Henderson 
1975).  They  utilized  a first-order  Taylor  series  approximation  for  evaluation  of  the 
intractable  integrals.  Simpler  models  that  allowed  for  only  single  random  effects 
and  utilized  numerical  integration  were  proposed  later  by  Jansen  (1990)  and  Ezzet 
and  Whitehead  (1991).  In  recent  work  by  Hedeker  and  Gibbons  (1994)  and  Tutz 
and  Hennevogl  (1996),  general  random  effects  regression  models  for  ordinal  responses 
were  proposed  along  with  a variety  of  estimation  procedures.  All  of  these  previous 
models  have  used  either  the  cumulative  logit  or  cumulative  probit  links.  In  contrast, 
Ten  Have  and  Uttal  (1994)  and  Ten  Have  (1996)  proposed  ordinal  random  effects 
models  based  on  the  continuation-ratio  logit  link  and  the  cumulative  complementary 
log-log  link,  respectively.  In  the  latter  model,  the  random  effects  distribution  was 
assumed  to  be  log-gamma,  which  resulted  in  a closed  form  marginal  likelihood. 

For  nominal  response  data,  Fahrmeir  and  Tutz  (1994,  p.  231)  proposed  a ran- 
dom effects  model  based  on  the  baseline-category  logit  model  as  did  Hedeker  (2000). 
Greater  attention  has  been  given  to  a special  case  of  this  model  in  the  psychometric 
literature,  however.  In  psychometric  research,  a popular  class  of  qualitative  response 
models  is  the  Rasch  family  of  models  (Rasch  1961).  Such  models  can  be  considered 


4 


as  baseline-category  logit  models.  Adams  and  Wilson  (1996)  considered  a baseline- 
category  logit  Rasch  model  that  allowed  for  shifted  thresholds.  This  work  was  then 
extended  by  Adams  et  al.  (1997)  to  allow  for  varying  thresholds.  We  will  consider 
these  models  in  greater  detail  in  Chapter  3. 

The  focus  of  this  dissertation  will  be  on  random  effects  models  for  nominal  and 
ordinal  data.  The  models  considered  will  be  motivated  as  extensions  to  multivariate 
generalized  linear  models  (MGLMs).  Careful  attention  will  be  given  to  estimation 
methods  when  the  random  effects  are  assumed  to  be  normally  distributed.  We  also 
propose  a model  in  which  this  assumption  is  relaxed,  resulting  in  a nonparametric 
random  effects  model.  Applications  of  these  models  to  multi-center  clinical  trial  data 
will  be  examined  in  detail.  We  begin  by  reviewing  in  greater  detail  some  of  the  work 
that  has  been  discussed  above.  In  Section  1.2  we  will  briefly  consider  models  for 
normal  responses.  Greater  attention  will  be  given  to  the  non-normal  response  case  in 
Section  1.3.  There  we  will  delineate  between  subject-specific  models  and  population- 
averaged  models.  In  the  final  section  of  this  chapter  an  outline  of  the  remainder  of 
the  dissertation  will  be  given. 

1.2  Normal  Response  Data 

As  noted  before,  there  has  been  an  extensive  amount  of  research  on  LMMs  for 
the  analysis  of  longitudinal  data  (see  e.g.,  Jones  1993;  Lindsey  1993;  Diggle  et  al. 
1994).  We  review  one  such  model,  originally  proposed  by  Harville  (1977)  and  further 
developed  by  Laird  and  Ware  (1982),  that  has  been  influential  in  the  development 
of  models  for  non-normal  responses.  Let  yi5  * = 1,  • • • , n,  be  the  response  vector  for 
the  ith  subject.  Harville  (1977)  proposed  the  linear  mixed  model 

yi  = Zi0  + Wi\ii  + e,  (1.1) 

where  /3  is  a vector  of  fixed  effect  parameters,  is  a vector  of  random  effects,  and  Zi 
and  Wi  are  corresponding  design  matrices.  In  addition,  the  vectors  e,,  i = 1,  • • • , n, 


5 


are  independent  and  distributed  as  N( 0,  i?j).  The  separation  of  the  fixed  and  random 
components  in  model  (1.1)  is  now  the  standard  for  expressing  LMMs  and  GLMMs. 
The  model  formulation  is  conditional  on  u,  and  the  definition  is  completed  by  assum- 
ing that  the  u,  are  distributed  as  N( 0,  Q),  independently  of  each  other  and  of  the  e*. 
Thus  marginally 


Zj/3  + e* , 


(1.2) 


where  e*  ~ IV(0,  V*)  and  Vi  — Ri  + WiQW[.  The  covariance  matrices,  Ri  and  Q,  are 
typically  functions  of  unknown  parameters,  and  allow  for  modeling  of  both  within 
subject  and  between  subject  associations. 

Let  0 be  the  vector  of  parameters  for  the  covariance  matrix  Vi.  If  0 is  known, 
then  the  maximum  likelihood  estimate  (MLE)  of  /3,  calculated  from  the  model  (1.2), 
is 


-l 


T.z‘vi~'z d 


(1.3) 


\i=l  / i=l 

and  is  equal  to  the  weighted-least-squares  solution.  Harville  (1977),  as  well  as  others, 
showed  that  the  MLE  of  /3  is  also  a best  linear  unbiased  estimator  (BLUE).  Due 
to  the  assumptions  of  normality  and  the  independence  of  u;  and  e,-,  the  predicted 
posterior  means  (modes)  of  Uj  are  easily  shown  to  be 


u,  = QW;V~'( y.  - Z„3). 


(1.4) 


In  practice  the  components  6 are  not  known  and  must  be  estimated.  Harville 
(1977)  considered  both  maximum  and  restricted  maximum  likelihood  estimation 
(REML)  of  6.  The  ML  estimates  of  6 are  calculated  by  maximizing  the  marginal  log- 
likelihood,  based  on  the  marginal  model  (1.2),  with  respect  to  (3  and  6.  The  REML 
estimates  can  be  derived  using  two  unrelated  approaches  (Laird  and  Ware  1982).  One 
method  of  obtaining  the  REML  estimates  is  by  maximizing  the  likelihood  of  9 based 


6 


on  a linearly  transformed  set  of  data  y*  = Ay  such  that  the  distribution  of  y*  does 
not  depend  on  (3.  The  transformed  data  y*  can  be  obtained  by  choosing  A to  be  the 
matrix  that  converts  y to  the  ordinary  least-squares  residuals.  The  REML  estimates 
can  also  be  derived  from  a Bayesian  perspective.  This  is  achieved  by  considering  (3 
in  model  (1.1)  as  a random  variable  having  a vague  or  totally  flat  prior  distribution 
such  that  the  prior  density  of  (3  is  a constant.  For  example,  let 

/3~Af(0,  T),  with  T — > oo, 

where  the  elements  of  T go  to  infinity.  The  REML  estimates  are  found  by  maximizing 
the  limiting  (as  T_1  — » 0)  marginal  log-likelihood  of  0 given  y.  Iterative  methods  are 
needed  to  calculate  either  the  ML  or  REML  estimates  of  6 since  the  likelihoods  are 
nonlinear  functions  of  6.  Harville  (1977)  proposed  Newton-Raphson  and  scoring  al- 
gorithms to  estimate  /3,  (ul5  • • • , un),  and  6.  A number  of  expectation-maximization 
(EM)  algorithms  (Dempster  et  al.  1977)  for  jointly  estimating  (/3,  u1(  • • • , u„,  0)  have 
been  proposed  as  well  (Laird  and  Ware  1982;  Jones  1993). 

1.3  Non-Normal  Response  Data 

We  now  review  models  for  longitudinal  and  clustered  data  when  the  response  is 
non-normal.  The  majority  of  these  models  can  be  viewed  as  extensions  to  general- 
ized linear  models  (Nelder  and  Wedderburn  1972;  McCullagh  and  Nelder  1989)  and 
so  we  begin  by  defining  GLMs.  Let  yi,  i — 1,  ■ • • , n,  be  the  response  for  the  *th 
subject  and  z,  the  corresponding  design  vector.  The  definition  of  a GLM  consists  of 
a distributional  assumption  and  a structural  assumption. 

1.  Distributional  Assumption: 

Conditional  on  z the  t/j  are  independent  and  have  a distribution  in  the  expo- 
nential family  with  E(yi  | z,)  = //j. 

2.  Structural  Assumption: 

The  expectation  is  related  to  the  linear  predictor  ry*  = z'/3  by  the  link  function 


7 


g(-)  or,  equivalently,  by  the  response  function  h(-)  — g 1(-)  such  that 

Vi  = 9(fM)  or  fM=h{ra). 

A GLM  is  fully  defined  by  the  distribution  in  the  exponential  family,  the  form  of  the 
linear  predictor,  and  the  response  or  link  function.  Some  of  the  distributions  in  the 
exponential  family  include  the  normal,  Poisson,  Bernoulli,  and  binomial.  Thus,  GLMs 
provide  a unified  model  for  both  continuous  and  discrete  responses.  In  Chapter  2 we 
will  show  how  the  multinomial  distribution  can  be  embedded  within  the  framework 
of  a multivariate  GLM. 

In  the  linear  mixed  model  (1.1)  and  the  marginal  linear  mixed  model  (1.2)  the 
expectations  of  the  response  y are  the  same.  That  is 

E{ y | u)  = E( y)  = Zf3. 

A simple  example  of  this  occurs  in  paired  experiments  where  the  mean  difference 
across  all  subjects  is  the  same  as  the  difference  between  the  two  overall  means.  This 
desirable  property  unfortunately  does  not  hold  for  non-normal  response  data.  Thus 
in  the  analysis  of  longitudinal  data  for  non-normal  responses,  a distinction  is  made 
between  subject-specific  (SS)  models  and  population-averaged  (PA)  models  (Zeger 
et  al.  1988;  Neuhaus  et  al.  1991;  Agresti  1993b).  In  SS  models  the  heterogeneity 
across  subjects  is  modeled  explicitly.  Random  or  mixed  effects  models  are  examples 
of  such  models.  Interpretation  of  the  parameters  in  SS  models  refer  to  the  influence 
of  covariates  upon  individuals.  In  PA  models  the  population-averaged  response  is 
modeled  without  explicitly  accounting  for  the  heterogeneity.  The  parameters  in  PA 
models  are  interpreted  as  the  averaged  population  response  to  changes  in  the  covari- 
ates. The  relationship  between  the  parameter  estimates  in  the  SS  model  and  the  PA 
model  has  been  studied  by  Zeger  et  al.  (1988)  and  Neuhaus  et  al.  (1991)  for  the  logistic 
and  probit  links  and  Ten  Have  et  al.  (1996)  for  the  cumulative  logit  link.  For  instance 


8 


in  the  logistic  mixed  model  with  a normal  random  intercept,  the  PA  parameters  /3* 
and  the  SS  parameters  /3,  both  of  dimension  p , satisfy  | f3£  |<|  /3*  |,  k = 1,  • • • ,p.  In 
contrast  the  parameters  in  a log-linear  model  with  a normal  random  intercept  will 
have  the  same  values  for  both  the  PA  and  SS  approaches,  expect  for  the  intercept 
term  (Diggle  et  al.  1994,  p.  142). 

To  demonstrate  the  differences  in  inference  between  SS  and  PA  modeling,  consider 
the  following  example.  Let  yit  be  the  response  at  time  t for  subject  i where  ya  = 1 if 
the  subject  has  high  blood  pressure  and  yit  = 0 otherwise.  Let  xit  be  a corresponding 
dummy  covariate  denoting  whether  or  not  subject  i was  exercising  at  time  t.  From  the 
PA  model  one  would  make  inference  about  the  rate  of  high  blood  pressure  between 
exercisers  and  non-exercisers.  The  SS  model  would  estimate  the  expected  change  in 
a subject’s  probability  of  having  high  blood  pressure  if  they  changed  their  exercise 
habits.  Thus  it  is  clear  that  the  choice  of  PA  versus  SS  modeling  is  often  dependent 
on  the  study  at  hand  and  the  desired  inference.  For  population  studies,  such  as 
those  found  in  epidemiology,  the  PA  approach  typically  provides  the  most  informative 
inference.  In  contrast  growth  curve  studies,  where  interest  lies  in  a subject’s  response 
profile  over  time,  lend  themselves  to  the  SS  approach.  In  the  next  two  sections  we 
examine  some  of  the  modeling  approaches  that  yield  SS  and  PA  inferences. 

1.3.1  Subject-Specific  Models 

We  first  review  GLMMs  for  Poisson,  binomial,  and  Bernoulli  data.  As  in  the 
linear  mixed  model  (1.1),  GLMs  are  extended  to  GLMMs  by  incorporating  random 
effects  linearly  within  the  linear  predictor 

rjij  = zL0  + wLii,.  (1.5) 

Here  the  subscript  j denotes  the  jth  observation,  j = 1,  • • • , Ti  , for  the  ith  subject, 
z ij  and  Wjj  are  design  vectors,  and  u;  is  a vector  of  random  effects.  The  GLMM  is 
defined  by  first  assuming  that  conditional  on  the  random  effects  u*,  yij  satisfies  the 


9 


definition  of  a GLM  with  the  linear  predictor  given  in  (1.5)  and  conditional  mean 
E{yij  | Uj)  = iiij.  The  definition  is  completed  by  assuming  the  random  effects  u;  are 
independent  with  distribution  G'(uj),  and  that  the  observations  are  conditionally 
independent  within  and  between  subjects. 

If  one  is  interested  in  making  inference  on  both  the  fixed  and  random  parameters, 
a full  maximum  likelihood  approach  to  estimation  should  be  used.  In  contrast  one 
may  be  only  interested  in  comparisons  within  subject.  Thus  a conditional  likelihood 
approach  could  be  used,  where  the  random  effects  are  treated  as  nuisance  parameters 
and  conditioned  out  of  the  likelihood  (see,  e.g.,  Conaway  1989). 

Maximum  likelihood  estimation 

In  the  full  maximum  likelihood  approach  the  distribution  of  the  random  effects 
is  incorporated  into  the  likelihood.  Though  in  theory  the  distribution,  G(u,),  of  the 
random  effects  may  be  any  distribution,  the  common  assumption  is  that  G'(uj)  is 
multivariate  normal  with  0 mean  and  covariance  matrix  Q.  Estimation  of  f3  and  6 , 
the  elements  of  Q,  entails  maximization  of  the  marginal  likelihood  of  the  obtained 
by  integrating  with  respect  to  the  random  effects  Uj.  Let  f{yij  | uj)  be  the  conditional 
distribution  of  yij  given  the  random  effect  U,.  For  the  GLMM  (1.5),  the  marginal 
log-likelihood  can  be  written 


Except  for  normal  response  data  as  in  Section  1.2,  the  marginal  likelihood  will  not 
have  a closed  form  due  to  the  intractable  multivariate  normal  integrals. 

The  logistic-normal  model  (Pierce  and  Sands  1975)  was  one  of  the  first  random 
effects  models  for  binary  data  in  which  the  random  term  was  incorporated  on  the 
same  scale  as  the  fixed  effects.  Both  Pierce  and  Sands  (1975)  and  Williams  (1982) 
considered  such  a model  that  allowed  for  a random  intercept  (i.e.  Wij  — 1 and 
Uj  = Ui  in  model  (1.5)).  Williams  (1982)  proposed  an  iterative  scheme  for  estimating 


(1.6) 


10 


the  regression  parameters  and  the  variance  of  the  random  effect,  which  was  based  on  a 
quasi-likelihood  but  only  fit  well  when  the  variability  was  small.  In  what  seems  to  be 
the  first  use  of  the  term  “generalized  linear  mixed  model”  in  the  literature,  Gilmour 
et  al.  (1985)  proposed  an  approximate  method  for  fitting  the  general  model  (1.5)  for 
the  probit  link.  They  proceeded  by  maximizing  the  likelihood  with  respect  to  the 
fixed  effects,  taking  expectations  over  the  random  effects.  The  resulting  estimating 
equations  are  analogs  of  the  REML  estimates  of  Harville  (1977).  Anderson  and 
Aitkin  (1985)  suggested  the  use  of  numerical  integration  within  the  context  of  an  EM 
algorithm  for  numerically  approximating  the  marginal  likelihood.  They  recommended 
the  use  of  Gauss-Hermite  quadrature,  which  will  be  discussed  in  detail  in  Chapter  3 
along  with  an  adaptive  version  of  Gauss-Hermite  quadrature  (Liu  and  Pierce  1994; 
Pinheiro  and  Bates  1995).  Hinde  (1982)  used  a similar  approach  in  the  context 
of  random  Poisson  regression  models.  The  development  of  methods  like  this  for 
calculating  the  exact  maximum  likelihood  estimates  by  approximating  the  integrals 
using  numerical  or  simulation  techniques  has  since  become  an  active  area  of  research 
for  GLMMs. 

When  the  dimension  of  the  random  effects  in  model  (1.5)  is  high,  the  use  of 
numerical  integration  techniques  such  as  Gauss-Hermite  quadrature  becomes  compu- 
tationally infeasible.  A number  of  alternatives  based  on  Monte  Carlo  (MC)  methods 
have  been  proposed  for  handling  the  larger  dimensional  models.  Zeger  and  Karim 
(1991)  considered  GLMMs  in  a Bayesian  context  and  proposed  an  algorithm  based 
on  the  Gibbs  sampler.  The  Gibbs  sampler  is  an  MC  method  for  generating  observa- 
tions from  a complex  joint  posterior  distribution,  when  sampling  from  the  conditional 
distribution  is  easier.  It  involves  a choice  of  prior  distributions  for  the  fixed  parame- 
ters and  the  components  of  Q.  Though  computationally  intensive,  the  algorithm  can 
accommodate  high  dimensional  integrals  and  can  be  easily  modified  to  handle  non- 
Gaussian  random  effect  distributions.  Karim  and  Zeger  (1992)  used  this  approach 


11 


for  analyzing  the  infamous  salamander  data  set  (McCullagh  and  Nelder  1989,  p.  439- 
450)  where  the  likelihood  involved  40-dimensional  intractable  integrals.  Though  it 
has  many  attractive  features,  the  use  of  a Bayesian  paradigm  with  flat  or  diffuse  pri- 
ors to  approximate  the  ML  estimates  may  lead  to  posteriors  that  do  not  exist.  This 
may  not  be  detected  when  using  the  Gibbs  sampler  (Natarajan  and  McCulloch  1995; 
Hobert  and  Casella  1996),  and  could  lead  to  incorrect  estimates.  McCulloch  (1994) 
proposed  a Monte  Carlo  EM  (MCEM)  algorithm  to  fit  a probit-binomial  model  with 
normal  random  effects  that  used  a Gibbs  sampler  to  approximate  the  E-step.  The 
Gibbs  chain  was  based  on  the  exact  conditional  distribution  of  u given  y.  Chan  and 
Kuk  (1997)  applied  this  approach  to  the  salamander  data  set  as  well,  allowing  the 
random  effects  to  be  correlated. 

McCulloch  (1997)  presented  three  algorithms  for  fitting  GLMMs  that  rely  on 
Monte  Carlo  techniques.  The  first  algorithm,  an  MCEM  algorithm,  utilizes  a Metropo- 
lis algorithm  (Tanner  1993)  to  approximate  the  intractable  integrals  in  the  expecta- 
tion step  of  the  EM  algorithm.  In  the  Metropolis  algorithm  one  chooses  both  a 
candidate  distribution  from  which  to  sample  new  values,  as  well  as  an  acceptance 
function  that  gives  the  probability  of  accepting  the  new  values.  McCulloch  (1997) 
showed  that  by  choosing  G( Uj)  as  the  candidate  distribution,  the  acceptance  func- 
tion has  a simple  form  involving  only  the  joint  conditional  distribution  /( y | u).  The 
second  algorithm  again  uses  the  Metropolis  algorithm,  but  within  the  context  of  a 
Newton-Raphson  algorithm.  The  algorithm  iteratively  solves  a score  equation  for  Q 
and  score- type  equation  for  /3.  The  scoring  equation  for  (3  involves  an  intractable 
expectation  where  the  Metropolis  algorithm  is  applied.  The  final  algorithm  simulates 
the  value  of  the  likelihood  directly  as  opposed  to  the  previous  algorithms  that  sim- 
ulate the  log-likelihood.  Using  a simulation  study,  McCulloch  (1997)  concluded  that 
both  the  MCEM  and  the  MC  Newton-Raphson  algorithms  were  feasible  methods  for 
calculating  ML  estimates  in  GLMMs.  The  simulated  maximum  likelihood  (SML) 


12 


approach  performed  poorly  as  a stand-alone  algorithm.  However,  by  running  SML  at 
the  final  estimates  of  the  MCEM  or  MC  Newton-Raphson  algorithms,  convergence 
issues  can  be  addressed,  slightly  more  precise  estimates  can  be  obtained,  and  an 
estimate  of  the  maximized  likelihood  is  available  (McCulloch  1997). 

A deficiency  in  the  previous  approaches  based  on  the  Gibbs  and  Metropolis  algo- 
rithms is  that  the  generated  samples  are  dependent  and  thus  MC  error  is  difficult  to 
assess.  Without  assessment  of  the  MC  error,  one  can  not  determine  the  appropriate 
number  of  simulations  to  use  at  each  iteration  of  the  algorithm.  Booth  and  Hobert 
(1999)  proposed  an  automated  MCEM  algorithm  which  utilized  random  sampling  to 
construct  the  MC  approximations  at  each  E-step.  With  random  sampling  they  were 
able  to  use  standard  central  limit  theory  with  Taylor  series  methods  to  assess  the  MC 
error  at  each  iteration.  The  MC  error  was  then  used  to  determine  the  MC  sample 
size,  creating  an  automated  algorithm.  We  will  look  at  this  algorithm  in  detail  in 
Chapter  3. 

Besides  exact  maximum  likelihood  analysis,  there  have  been  several  proposed 
methods  for  carrying  out  approximate  inference  in  GLMMs.  Breslow  and  Clay- 
ton (1993)  proposed  a penalized  quasi-likelihood  (PQL)  approach,  Wolfinger  and 
O’Connell  (1993)  utilized  a pseudo-likelihood  approach,  while  Engel  and  Keen  (1994) 
used  a combination  of  quasi-likelihood  and  REML.  Though  each  motivated  their 
methods  using  different  approaches,  they  all  yield  equivalent  estimates  in  certain 
cases,  so  we  briefly  describe  the  pseudo-likelihood  approach.  We  first  express  model 
(1.5)  in  terms  of  the  complete  data 

r]  = g(n)  = Zp  + Wu.  (1.7) 

Wolfinger  and  O’Connell  (1993)  showed  that  the  pseudo  observation  vector 


v = g(ji)  + g'(ii)  (y  - /»)  « g( y) 


(1.8) 


13 


which  is  a Taylor  series  approximation  to  the  linked  response  function  g( y),  has 
distribution 


Here  R ^ is  a known  diagonal  covariance  matrix  for  the  GLM  under  consideration,  and 
R is  an  additional  covariance  matrix  for  modeling  PA  effects.  In  the  PQL  approach, 
R is  the  identity  matrix.  Considering  (3  as  unknown  parameters  and  assuming  G(uj) 
is  multivariate  normal  as  before,  then  (1.9)  takes  the  form  of  a weighted  LMM  with 
diagonal  weight  matrix 


If  we  write  V = H~ 1</2  R if-1/2  + W Q W'  then  the  solutions  for  /3  and  Uj  take  the 
same  form  as  (1.3)  and  (1.4).  REML  or  ML  estimation  can  be  used  to  estimate  the 
elements  of  Q.  There  are  a number  of  advantages  to  using  the  approximate  methods 
for  fitting  GLMMs.  By  using  approximations  they  avoid  the  intractable  integrals 
that  plague  the  exact  maximum  likelihood  approaches.  As  a consequence  the  fitting 
of  these  models  is  relatively  simple,  and  as  seen  above,  existing  methods  for  LMMs 
can  be  used.  Unfortunately  it  has  been  shown  that  these  methods  perform  poorly 
when  the  response  is  far  from  normal,  as  in  binomial  data  with  small  sample  sizes 
(Breslow  and  Clayton  1993;  Breslow  and  Lin  1995;  Lin  and  Breslow  1996). 

The  GLMMs  discussed  so  far  have  assumed  that  the  distribution  of  the  random 
effects  was  normal  and  thus  (1.6)  involved  intractable  integrals.  For  certain  models, 
however,  the  distribution  of  G( iq)  can  be  chosen  such  that  (1.6)  has  a closed  form. 
A number  of  such  models  (e.g.,  beta-binomial  and  Poisson-gamma)  are  found  in 
the  literature  on  overdispersion  (see,  e.g.,  Hinde  and  Demetrio  1998).  Typically  these 
models  are  not  well  suited  for  full  random  effects  modeling  as  inclusion  of  between  and 
within  covariates  is  often  difficult.  Conaway  (1990)  proposed  a random  effects  model 


(1.9) 


» = V bV)]-2. 


14 


for  binary  data  that  utilized  the  log-log  link:  g(fii)  = log(—  log(/r,)).  By  assuming 
that  the  random  effects  distribution  was  log-gamma,  the  resulting  marginal  likelihood 
was  shown  to  have  a closed  form.  Conaway  (1990)  noted  that  the  log-gamma  distri- 
bution was  flexible  enough  to  resemble  a variety  of  distributions  including  a normal 
distribution. 

There  has  been  some  evidence  to  suggest  that  changes  in  the  random  effects  dis- 
tribution can  lead  to  changes  in  the  fixed  effects  parameter  estimates  (Heckman  and 
Singer  1984;  Davies  1987).  Neuhaus  et  al.  (1992)  showed  that  model  parameters 
were  indeed  inconsistent  if  the  random  effects  distribution  was  misspecified,  though 
the  magnitude  of  the  bias  was  typically  small.  There  has  been  a moderate  amount 
of  recent  work  focused  on  nonparametric  approaches  to  fitting  random  effects  mod- 
els (Davies  1987;  Wood  and  Hinde  1987;  Follmann  and  Lambert  1989;  Aitkin  1996; 
Aitkin  1999).  Such  approaches  assume  that  G(u,)  is  a discrete  distribution  with  un- 
known masses,  mass  points,  and  support  size.  Thus  the  integrals  in  (1.6)  are  replaced 
with  a finite  sum  over  an  unknown  support  size.  By  maximizing  (1.6)  one  obtains 
nonparametric  maximum  likelihood  (NPML)  estimates  for  the  regression  parameters 
as  well  as  the  discrete  mixing  distribution.  In  Chapter  4 we  will  consider  an  NPML 
approach  for  modeling  nominal  and  ordinal  response  data. 

The  development  of  random  effects  models  for  nominal  and  ordinal  data  has  lagged 
behind  that  for  binomial  data.  Most  of  the  proposed  models  have  used  either  the 
cumulative  logit  or  cumulative  probit  links  for  analyzing  ordinal  data.  Harville  and 
Mee  (1984)  proposed  a cumulative  probit  random  effects  model  and  used  an  EM- type 
algorithm  for  obtaining  estimates  of  fixed  and  random  parameters.  Using  a Bayesian 
approach,  they  assumed  a vague  prior  for  the  fixed  parameters  and  applied  a Tay- 
lor series  approximation  to  evaluate  the  intractable  normal  integrals.  Following  the 
approach  of  Anderson  and  Aitkin  (1985),  Jansen  (1990)  used  Gauss-Hermite  quadra- 
ture to  approximate  the  E-step  in  an  EM  algorithm.  He  considered  the  cumulative 


15 


probit  link  and  allowed  for  a single  random  intercept  in  a model  for  analyzing  an 
agricultural  experiment.  Ezzet  and  Whitehead  (1991)  applied  an  ordinal  random  ef- 
fects model  to  the  analysis  of  a crossover  experiment.  A cumulative  logit  model  was 
fit  with  a single  random  intercept  that  was  assumed  to  be  normally  distributed.  The 
Newton-Raphson  method  along  with  numerical  integration  were  used  to  maximize 
the  likelihood. 

Two  recent  papers  by  Hedeker  and  Gibbons  (1994)  and  Tutz  and  Hennevogl  (1996) 
presented  general  approaches  for  fitting  GLMMs  for  ordinal  response  data,  each  using 
quite  different  estimation  algorithms.  Hedeker  and  Gibbons  (1994)  considered  both 
cumulative  probit  and  cumulative  logit  models  and  based  their  estimation  methods 
on  directly  maximizing  the  marginal  likelihood.  To  evaluate  the  intractable  marginal 
likelihood,  they  approximated  the  normal  integrals  using  multivariate  Gauss-Hermite 
quadrature.  Fisher’s  scoring  method  was  used  to  obtain  estimates  of  the  parameters 
and  the  inverse  of  the  expected  information  matrix  was  calculated  at  convergence 
to  obtain  standard  errors.  Tutz  and  Hennevogl  (1996)  motivated  their  model  as 
a generalization  of  a multivariate  GLM.  Considering  a cumulative  logit  model,  they 
proposed  three  estimation  procedures,  each  based  on  the  EM  algorithm.  In  two  of  the 
algorithms  MC  methods  were  used  to  approximate  the  integrals  in  the  E-step,  while 
in  the  third  multivariate  Gauss-Hermite  quadrature  was  applied.  Besides  allowing 
for  a random  threshold  models  in  which  all  threshold  are  shifted  by  the  same  random 
effect,  Tutz  and  Hennevogl  (1996)  also  proposed  a model  in  which  each  threshold 
was  allowed  to  vary  according  to  its  own  random  effect.  They  allowed  the  separate 
random  effects  to  be  either  independent  or  correlated.  We  will  consider  this  extended 
model  in  more  detail  in  Chapter  3,  as  well  as  the  EM  algorithms  proposed  by  Tutz 
and  Hennevogl  (1996).  Based  on  a motivation  similar  to  that  in  Tutz  and  Hennevogl 
(1996),  Fahrmeir  and  Tutz  (1994)  proposed  a nominal  random  effects  regression  model 


16 


which  utilized  the  baseline-category  logit  link.  This  model  will  be  considered  as  well 
in  Chapter  3. 

As  an  extension  of  their  previous  methods  for  approximate  inference  in  GLMMs 
(Engel  and  Keen  1994),  Keen  and  Engel  (1997)  proposed  an  iteratively  re- weighted 
REML  estimation  routine  for  mixed  ordinal  regression  models.  Their  method  was 
essentially  the  PQL  approach  of  Breslow  and  Clayton  (1993)  for  ordinal  data  using 
a cumulative  logit  link.  We  consider  approximate  inference  methods  for  nominal 
and  ordinal  data  in  Chapter  3 based  on  generalizing  the  methods  of  Wolfinger  and 
O’Connell  (1993). 

Two  ordinal  models  that  did  not  use  the  cumulative  logit  or  probit  links  were 
proposed  by  Ten  Have  and  Uttal  (1994)  and  Ten  Have  (1996).  Ten  Have  and  Uttal 
(1994)  proposed  both  PA  and  SS  continuation-ratio  logit  models  for  analyzing  mul- 
tiple discrete  time  survival  profiles  of  subjects  in  a psychological  study.  For  a given 
set  of  multinomial  probabilities  7Ti , • • • ,7 tr  such  that  = 1)  the  continuation 

ratio  logits  are  given  as 


Ten  Have  and  Uttal  (1994)  used  a Bayesian  approach  assuming  a non-informative 
prior  on  the  regression  parameters  and  a multivariate  normal  distribution  for  the 


to  estimate  the  model  parameters.  Ten  Have  (1996)  extended  the  binary  random 
effects  model  of  Conaway  (1990)  to  accommodate  ordinal  data.  Here  the  cumulative 
complementary  log-log  link  along  with  a log-gamma  random  effects  distribution  was 
used,  which  resulted  in  a closed  form  marginal  likelihood. 

Conditional  maximum  likelihood  estimation 


random  effects.  They  applied  the  Gibbs  sampling  approach  of  Zeger  and  Karim  (1991) 


We  now  briefly  describe  the  conditional  maximum  likelihood  (CML)  approach 
and  mention  a few  applications.  See,  for  example,  Andersen  (1980),  Collett  (1991) 


17 


or  Diggle  et  al.  (1994)  for  further  discussion.  In  CML  estimation,  one  treats  the 
random  effects  as  nuisance  parameters  and  conditions  on  their  sufficient  statistic. 
Estimation  proceeds  by  maximizing  the  conditional  likelihood.  CML  estimation  is 
appropriate  when  one  is  interested  in  within  subject  comparisons.  This  approach  is 


advantage  of  this  approach  over  the  ML  approach  is  that  one  does  not  need  to  assume 
a distribution  for  the  random  effects.  A consequence  of  conditioning,  however,  is  that 
no  information  about  the  variability  between  subjects  is  obtained.  Also,  construction 
of  sufficient  statistics  is  limited  to  canonical  link  models,  such  as  the  logit  model  for 
binary  data. 

An  example  of  the  CML  approach  can  be  shown  with  item-response  models.  Such 
models  arise  in  educational  testing  where  a set  of  n subjects  are  administered  a series 
of  T questions  (items)  which  have  either  a correct  or  incorrect  answer.  The  Rasch 
model  (Rasch  1961)  assumes  that  the  probability  nuj  of  a correct  answer  for  subject 
i and  question  j can  be  modeled  by 


Since  there  is  a parameter  a*  for  each  subject,  the  number  of  parameters  increases  as 
the  sample  size  increases.  In  fact,  ordinary  ML  estimators  of  /3j  are  inconsistent  for 
model  (1.10)  (Andersen  1980,  p.  244)  . One  could  assume  a distribution  for  the  a* 
and  proceed  by  methods  of  the  previous  section  to  obtain  marginal  ML  estimates  of 
the  (3j.  Alternatively  one  can  apply  the  CML  approach  by  finding  sufficient  statistics 
for  the  a*  and  then  maximize  the  likelihood  conditional  on  the  sufficient  statistics.  As 
was  originally  shown  by  Tjur  (1982),  the  CML  estimates  for  model  (1.10)  correspond 
to  standard  ML  estimates  for  certain  log-linear  models.  It  was  later  noted  that 
the  CML  estimates  of  (3j  were  in  fact  the  main  effect  ML  estimates  from  the  quasi- 
symmetry log-linear  model  (see,  e.g.,  Fienberg  1986).  Methods  for  fitting  such  models 


frequently  used  in  matched  case-control  studies  and  in  educational  testing  studies.  An 


(1.10) 


18 


for  nominal  and  ordinal  responses  have  been  discussed  in  Agresti  (1993a,  1993b)  and 
Agresti  and  Lang  (1993). 

1.3.2  Population-Averaged  Models 

For  longitudinal  or  cluster  data,  models  that  study  the  averaged  response  over 
all  subjects  or  clusters  are  called  PA  or  marginal  models.  In  these  models,  the  re- 
lationship between  the  response  and  the  explanatory  variables  is  modeled  separately 
from  the  association  among  repeated  observations  for  an  individual.  Two  of  the  land- 
mark papers  for  PA  modeling  for  GLMs  are  Zeger  and  Liang  (1986)  and  Liang  and 
Zeger  (1986).  To  specify  a marginal  model  one  must  specify  the  marginal  mean, 
the  marginal  variance,  and  the  covariance  function.  As  an  example  we  consider  the 
marginal  specification  of  a binary  response  model. 

As  in  the  definition  of  a GLM,  we  specify  the  marginal  means  as 

= p(Vij  = 1 I zo)  = MztA)-  (i-11) 

For  the  binary  response  model,  the  marginal  variance  depends  on  the  marginal  mean 
by 

var (ytj  I z tj)  = 7Ty(l  - 7 Tij).  (1.12) 

In  marginal  models,  one  also  must  define  a covariance  function  for  modeling  the 
covariance  between  yij  and  y^r. 

CO  v{yij,yij')  = c(7Ty,7Ty/,a).  (113) 

The  covariance  function  c(nij,  ir^i,  a)  depends  on  the  marginal  means  and  possibly 
additional  association  parameters  a.  A special  feature  of  marginal  models  is  that  the 
parameters  (3  can  be  consistently  estimated  even  if  c(nij,  7r ,y,  a)  is  misspecified  (Zeger 
et  al.  1988).  Because  of  this,  the  cov(y)  = E(/3,a),  which  denotes  the  combination 
of  (1.12)  and  (1.13)  for  the  complete  data,  is  treated  as  a working  covariance  matrix. 


19 


A number  of  ways  have  been  proposed  for  specifying  the  working  covariance  ma- 
trix. Liang  and  Zeger  (1986)  define  E(/3,a)  in  terms  of  a working  correlation  matrix 
Ra  such  that 

E (p,tx)  = R1£ROL  R]i\ 

where  R-n  is  diag[7rij(l  — If  Ra  = I then  all  observations  are  treated  as 

independent.  One  can  parameterize  Ra  to  allow  for,  for  example,  autoregressive 
(AR)  correlations,  banded  correlations,  or  entirely  unstructured  correlations.  Lipsitz 
et  al.  (1991)  proposed  specifying  the  working  covariance  matrix  in  terms  of  odds 
ratios.  One  advantage  of  this  approach  is  that  the  odds  ratios  are  not  constrained  by 
the  means  7r as  in  the  correlation  specification. 

Marginal  model  estimation  is  fundamentally  different  from  SS  model  estimation 
since  specification  of  (1.11),  (112),  and  (1.13)  does  not,  except  for  Gaussian  data, 
fully  define  a likelihood.  Estimation  in  marginal  models  is  based  on  generalized  es- 
timating equations  (GEEs)  which  are  multivariate  analogues  of  quasi-score  functions 
(Wedderburn  1974).  Briefly,  in  quasi-likelihood  models  the  assumption  of  an  expo- 
nential family  is  dropped  from  the  GLM  definition,  and  the  model  is  defined  only  by 
assumptions  on  the  first  and  second  moments.  Under  appropriate  conditions,  param- 
eters can  be  estimated  consistently  and  asymptotic  inference  is  still  possible.  The 
score  functions  for  such  models  take  the  same  form  as  those  based  on  GLMs  but 
are  not  likelihood  equations  because  they  lack  a distributional  assumption.  Hence 
they  are  called  quasi-likelihood  or  quasi-score  functions.  In  matrix  form  for  binary 
response  data,  the  GEE  for  (3  is 

71 

S(3(P’  a)  = Yl  Z'AE-1(/3,  a)(yi  - 7r<)  = 0 (1.14) 

»=1 

where  Z{  is  the  design  matrix  for  subject  i and  Dj  = diag  [d  hj(z'ij(3J)/dT]ij] . For  fixed 
a,  estimates  of  (3  can  be  obtained  from  (1.14)  by  using  Fisher’s  scoring. 


20 


Several  ways  have  been  suggested  for  estimating  the  unknown  association  parame- 
ters a.  Liang  and  Zeger  (1986)  proposed  using  estimates  based  on  Pearson  residuals. 
Other  approaches  suggest  defining  an  estimating  equation  for  a (see,  e.g.,  Prentice 
1988).  The  GEE  approach  has  also  been  extended  to  the  cases  of  repeated  nominal 
and  ordinal  data  (see,  e.g.,  Lipsitz  et  al.  1994).  Other  marginal  models  that  are  in  the 
same  spirit  as  the  GEE  approach  are  the  marginal  quasi-likelihood  (MQL)  approach 
of  Breslow  and  Clayton  (1993),  and  the  pseudo-likelihood  approach  of  Wolfinger  and 
O’Connell  (1993)  which  contains  a PA  covariance  matrix  for  modeling  PA  correla- 
tions. 

We  have  stressed  the  differences  in  interpretation  between  the  PA  and  SS  ap- 
proaches which  arise  from  modeling  the  marginal  mean  in  the  former  and  the  condi- 
tional mean,  conditioned  on  the  random  effects,  in  the  latter.  The  differences  between 
PA  and  SS  approaches  have  been  blurred  with  a recent  model  proposed  by  Heagerty 
(1999).  He  proposed  a marginalized  latent  variable  model  in  which  the  marginal 
mean,  not  the  conditional  mean,  conditioned  on  the  random  effects  is  modeled  as  a 
function  of  covariates.  He  presented  their  model  from  the  context  of  a multi-level 
model  which  we  outline  here  for  the  two  level  case.  The  model  has  two  components, 
the  first  of  which  defines  the  regression  structure  for  the  marginal  mean 

9(mj)  = Kj  PM  > U-15) 

while  the  second  describes  the  dependence  among  measurements  within  a cluster: 

g(tfj)  = A(z  ij)  + Ui.  (1.16) 

It  is  also  assumed  that  the  elements  of  the  response  vector  y * are  conditionally  in- 
dependent given  u and  that  the  distribution  of  u is  completely  specified  by  the  pa- 
rameter a.  The  M in  /3M  is  used  to  denote  that  the  parameters  are  marginally 
defined.  The  parameter  A(z ij)  in  (1.16)  is  a function  of  both  the  linear  predictor 


21 


r]ij  = and  the  random  effects  distribution  Fa{ui)-  As  an  example,  assume 

that  Ui  ~ N[0,  o"(zjj)]  and  re-write  n,-  = cr(zjj)  V{  where  ~ N(0, 1).  The  parameter 
A (Zy)  is  then  defined  as  the  solution  to  the  integral  equation  that  links  the  marginal 
and  conditional  means: 

= E(p%)  (1.17) 

h(*lij ) = J h [A(zy)  + <r(zy)  t/(J  <fr(Vi)  dvit  (1.18) 

where  4>  is  the  standard  normal  density  function.  Heagerty  (1999)  detailed  an  algo- 
rithm for  fitting  the  marginalized  latent  variable  model,  which  involves  numerically 
evaluating  the  convolution  equation  (1.18)  to  solve  for  A(z* j).  Once  obtained,  exist- 
ing maximum  likelihood  algorithms  for  GLMMs  can  be  used  to  fit  the  marginalized 
models. 

As  noted  before,  (3M  in  (1.15)  has  a PA  interpretation.  From  (1.18),  cr(z jj)  can 
be  interpreted  as  a coefficient  of  a standardized  omitted  covariate  Uj,  with  cr(zij) 
contrasting  subjects  with  equal  A(z ij)  whose  vt  differ  by  one  unit  (Heagerty  1999). 
It  is  also  possible  to  calculate  subject  level  effects  based  on  the  implied  conditional 
linear  predictor  A(z ij).  Since  the  marginalized  latent  variable  models  are  estimated 
by  maximum  likelihood  and  assume  an  underlying  latent  variable,  they  can  be  directly 
compared  with  conditionally  defined  models.  The  marginalized  latent  variable  models 
provide  the  flexibility  and  interpretability  of  random  effects  models  for  introducing 
dependence  while  building  regression  structures  for  the  marginal  mean.  Heagerty 
(1999)  argued  that  marginal  mean  modeling  allows  for  valid  application  with  both 
time-dependent  and  time-independent  covariates. 

1.4  Outline 

The  models  we  will  consider  in  Chapters  3 through  5 can  be  considered  as  multi- 
variate generalized  linear  mixed  models.  Thus  we  will  begin  by  defining  multivariate 


22 


GLMs  in  Chapter  2 as  well  as  introduce  some  notation  that  we  will  use  throughout 
the  dissertation.  In  Chapter  3 we  will  consider  parametric  approaches  to  modeling 
multinomial  random  effects  models.  We  first  define  the  multivariate  generalized  linear 
mixed  model  and  then  show  how  multinomial  random  effects  models  can  be  embed- 
ded within  this  framework.  In  this  chapter  we  consider  two  approaches  for  modeling 
the  multinomial  random  effects  model,  one  based  on  numerical  approximations  of  the 
integrals  and  the  other  based  on  Taylor  series  approximations.  For  the  first  approach 
we  propose  two  algorithms,  a quasi-Newton  adaptive  Gauss-Hermite  algorithm  and 
an  automated  Monte  Carlo  EM  algorithm,  while  for  the  second  we  utilize  a restricted 
pseudo-likelihood  algorithm.  A number  of  applications  will  be  considered  to  illustrate 
the  proposed  models.  We  conclude  Chapter  3 by  examining  the  extended  threshold 
model  of  Tutz  and  Hennevogl  (1996). 

As  an  alternative  to  the  parametric  approaches  given  in  Chapter  3,  we  propose, 
in  Chapter  4,  a nonparametric  approach  for  modeling  multinomial  random  effects 
models.  In  particular,  we  consider  models  that  allow  for  a random  intercept  and 
outline  an  EM  algorithm  for  fitting  such  models.  An  important  issue  in  nonpara- 
metric maximum  likelihood  methods  is  the  identifiability  of  the  model  parameters. 
For  the  proposed  multinomial  models,  we  discuss  this  issue  and  provide  a sufficient 
condition  for  ensuring  identifiability.  We  illustrate  the  proposed  models  using  one  of 
the  datasets  analyzed  in  Chapter  3.  We  then  conclude  Chapter  4 with  two  simula- 
tion studies.  In  the  first  simulation  study  we  compare  the  nonparametric  maximum 
likelihood  approach  to  parametric  approaches  in  Chapter  3.  In  the  second  study  we 
examine  the  performance  of  the  Wald  and  likelihood-ratio  tests  for  the  nonparametric 
modeling  approach  as  compared  with  the  equivalent  tests  in  the  parametric  approach. 

In  Chapter  5 we  examine  the  use  of  the  proposed  methods  in  Chapters  3 and  4 
for  analyzing  ordinal  response  data  arising  from  multi-center  clinical  trials.  In  this 
type  of  data,  two  treatments  are  compared  with  respect  to  an  ordinal  response  at 


23 


multiple  centers.  If  one  assumes  that  the  centers  represent  a random  sample  from 
some  population  of  centers,  one  can  utilize  random  effects  models  to  account  for 
heterogeneity  among  the  centers.  Typically,  however,  the  number  of  centers  is  small 
and  the  assumption  of  normality  for  the  random  effects  is  questionable.  We  utilize 
simulations  to  examine  the  performance  of  a heterogeneous  random  effects  model  that 
includes  a random  center  effect  and  a random  center- by-treatment  effect.  We  also 
propose  an  adaptive  Gauss-Hermite  quadrature  approximated  test  for  testing  that 
a subset  of  the  covariance  matrix  for  the  random  effects  is  zero.  We  then  conclude 
in  Chapter  6 with  a summary  of  the  dissertation  and  proposals  for  possible  areas  of 
future  research. 


CHAPTER  2 

MULTIVARIATE  GENERALIZED  LINEAR  MODELS 
2.1  Introduction 

The  models  of  the  next  three  chapters  will  be  motivated  as  extensions  of  multi- 
variate generalized  linear  models.  There  are  a number  of  advantages  for  motivating 
the  models  in  this  manner.  First,  approaching  the  models  in  this  general  framework 
allows  for  a single,  unified  notation  for  all  models.  Modifications  for  specific  models 
involve  relatively  few  changes,  such  as  in  the  link  function  and  design  matrix.  Sec- 
ond, the  forms  of  the  score  functions,  information  matrices,  and  fitting  algorithms 
are  known  for  (multivariate)  GLMs.  Thus  definitions  of  algorithms  for  the  random 
effects  models  can  be  more  easily  derived  using  pieces  from  the  fixed  effects  GLMs. 
Finally,  extensions  to  other  models  for  nominal  and  ordinal  data  not  discussed  here 
should  be  straightforward  with  the  tools  given  in  the  next  three  chapters. 

We  begin  in  Section  2.2  by  defining  the  multivariate  GLM  and  by  showing  how  the 
multinomial  distribution  can  be  embedded  within  the  multivariate  GLM  framework. 
In  Section  2.3  we  discuss  maximum  likelihood  estimation  in  multivariate  GLMs  for 
the  special  case  of  the  multinomial  distribution.  In  the  final  section  we  apply  the 
general  multivariate  GLM  framework  to  the  specific  models  we  will  be  discussing  in 
the  next  three  chapters.  The  notation  introduced  in  the  next  three  sections  will  be 
utilized  throughout  the  remainder  of  the  dissertation. 

2.2  Definition 

Proceeding  as  in  Fahrmeir  and  Tutz  (1994,  Chap.  3),  let  yb  = (yiji,  • • • , yijq ) be  a 
^-dimensional  response  vector  with  corresponding  p-dimensional  covariate  vector  xb 
= (Xijl  j ' ' ' j xijp  ) where  E(yij  | x^)  = In  anticipation  of  the  models  in  the  next 


24 


25 


three  chapters,  we  include  both  a subject  i subscript,  and  an  observation  j within 
subject  subscript.  We  also  assume  that  subject  i has  T)  observations,  i = 1,  • • • , n. 
For  clarity  we  will  use  boldfaced  lowercase  letters  (e.g.,  y)  for  column  vectors  and 
denote  matrices  by  capital  letters. 

As  in  the  univariate  GLM  definition  in  Section  1.2,  the  multivariate  GLM  is 
defined  by  a distributional  assumption  and  a structural  assumption. 

1.  Distributional  Assumption: 

The  yij  are  assumed  independent  given  the  x*j  and  have  a distribution  that 
belongs  to  a multivariate  exponential  family  with  the  following  form 

r(  i a i \ / [y ij@ij  ~ b(&ij)]  ( t ^ 

/(yij  | Oij,  </>,  Uij)  = exp  | — Uij  + c( yi:j,  (f>,  Uij)  | , (2.1) 

where  0^  is  the  natural  parameter,  (j)  is  an  additional  scale  parameter,  b(-)  and 
c(-)  are  functions  determined  by  the  member  of  the  exponential  family,  and  u)ij 
is  a vector  of  weights. 

2.  Structural  Assumption: 

The  linear  predictor  77^  = Zij/3  is  related  to  the  expectation  by  the  vector- 
valued response  function  h = (hi,--  - ,hq)  such  that  /x^-  = h(rji:j).  Zij  is  a 
(■ q x p ) design  matrix  and  /3'  — (/3i,  - ■ ■ ,/3p)  is  a vector  of  unknown  parameters. 
Alternatively  g(/x^)  = rj^  where  g,  the  link  function,  is  the  inverse  of  the 
response  function  h. 

We  also  define  v(-)  to  be  the  vector-valued  function  that  relates  the  natural  parameter 
9ij  directly  to  the  linear  predictor  77^  such  that 

9ij  = viVij)-  (2.2) 

The  form  of  the  design  matrix,  parameter  vector,  and  the  response  and  link  functions 
will  depend  on  the  models  being  fit.  In  Section  2.4  we  will  define  these  items  for  the 
models  that  we  will  examine  in  Chapters  3 through  5. 


26 


Models  for  nominal  and  ordinal  data  are  based  on  the  multinomial  distribution. 
Let  J#>,  s = 1,  • • • ,n,j,  represent  categorical  response  variables  having  possible 
values  = To  express  the  multinomial  distribution  in  the  multivariate 

exponential  form  we  first  re-express  as  a dummy  vector  yj^  =(yjj ],•••  , y\f(!) 
where 


(s)  _ 
yijr 


1 if  Y^s)  =r,  r = !,•••  ,q. 


0 otherwise 


Thus  for  riij  independent  repetitions,  y y = X^t=i  Yif  *s  distributed  multinomial  with 
parameters  and  = (ffiji,  ■ ■ • ,7 r,j9).  Then,  following  Fahrmeir  and  Tutz  (1994, 
p.  69),  the  distributional  form  of  y ^ = yij/riij  can  be  written 


/(y.j  I OijAiUij)  - exp 


[yp-g«j  - fc(0»j)] 

4> 


+ c(y*j>  </*>  w*i)  f , 


where  the  natural  parameter  Oij  has  components 


6L>  = log 


7T, 


ijr 


1 TTjji 


7T- 


09 


r = !,■••  ,9, 


6(0o)  = - l°g(!  - ’T.jX 7Tijg), 


c(yo-.^.«y)  = log 


n 


o- 


. 2/ij  1 1 ' ■ ■ 2/ij Q 1 (nij  yijl  ■ ■ ■ yijq)  • / 

and  Wii  = riij.  In  this  framework  the  expectation  is  denoted  by  7r,j. 


'ij 


(2.3) 


2.3  Maximum  Likelihood  Estimation 

We  now  outline  maximum  likelihood  estimation  for  a multivariate  GLM  based  on 
the  multinomial  form  (2.3)  (Fahrmeir  and  Tutz  1994,  p.  98).  In  general,  the  MLE 
$ is  calculated  by  finding  the  solution  of  the  likelihood  or  score  equations.  This 
solution  is  only  a local  maxima  in  general,  but  corresponds  to  the  global  maximum 
as  well  when  the  log-likelihood  is  concave.  The  score  equations  are  typically  non- 
linear and  thus  an  iterative  procedure  for  finding  /3  must  be  used.  Common  iterative 


27 


procedures  for  fitting  GLMs  are  Fisher  scoring,  iteratively  re-weighted  least  squares, 
and  Newton-Raphson,  which  we  now  describe. 

We  proceed  by  first  calculating  the  score  equations  for  /3,  which  require  the  first 
derivative  of  the  log-likelihood.  The  log-likelihood  for  (2.3)  depends  on  (3  only  through 
the  kernel 


um 


y iAj  - KOh) 


<t> 


-u: 


i]i 


(2.4) 


such  that  the  log-likelihood  is 

n Ti 

= <2-5) 

»=i  j= i 

In  (2.4),  Oij  = 0(TTij)  is  a function  of  nij,  while  nij  = TZij{0)  = h (Zij(3)  is  a function 
of  (3.  Thus  the  derivative  of  (2.5)  with  respect  to  (3  requires  the  use  of  the  chain 
rule  for  differentiation  of  vectors.  Noting  that  dir^/d  f3  = Z\  - Dij  where  = dh/dr] 

evaluated  at  rj^  = Zij/3 , the  score  function  takes  the  form 

n t, 

s{0)  — y ' y ' z^  (y  tj  ~ (2-6) 

i= 1 j= 1 

In  general  R = cov(y \j)  is  the  covariance  matrix  for  observation  y \j  which  depends 
on  (3  through  7 r^.  Note  that  (2.6)  has  the  same  form  as  the  GEE  for  (3  (1.14),  which 
was  mentioned  in  Section  2.2.2.  For  the  multivariate  distribution,  the  covariance 
matrix  for  y jj  has  the  form  R^t]  = ^-(diag(7Tij)  — 

To  obtain  the  expected  Fisher  information  matrix,  we  take  the  expected  value  of 
s{/3)s(/3)'  yielding 

n Ti 

w)  = y.  y »*.,  ,yi, 

t=l  j= 1 
n T \ 

= YEW*2* 

*=i  j=i 


(2.7) 


28 


where 


Ha 


= Ha  RfCi-  Ds4 


*? 


d&{*a)  o ds (*a) 

t'ir, 


drc’ 


dn 


is  called  the  weight  matrix. 

Using  (2.6)  and  (2.7),  the  Fisher  scoring  algorithm  for  estimation  of  (3  is  given  by 


/3  = /3  +Fe{/3  )s(0  ),  fe  = 0, 1,2,  • • ■ . 


(2.8) 


This  is  in  fact  equivalent  to  iteratively  re- weighted  least  squares.  First  note  that  we 
can  re-write  the  score  function  (2.6)  in  terms  of  the  weight  matrix  as 

= (2.9) 

i=l  j=l 

If  we  then  define  the  “pseudo”  observation  as 

Yij  = Zij0  + [D^]'{ fa  ~ TTij), 


we  can  re-write  (2.8),  using  (2.9)  and  y'  = (yu,  • • • ,ynTn),  as 


J3<*’+1>  = Fb'(0W)Z’  H0lt))y0{t)) 

Z'H0m)Z Y'  Z'H0m)y, 


(2.10) 


which  is  of  the  form  of  a weighted  least-squares  estimate.  By  iteratively  calculating 
(2.10),  one  can  obtain  the  MLE  for  (3.  Here  Z and  H denote  the  design  and  weight 
matrices,  respectively,  for  the  entire  data,  where 


Zn 

Hn 

0 

z = 

Z\2 

and 

H = 

Hu 

ZfiTn  - 

0 

HnTn_ 

29 


For  the  remainder  of  this  dissertation,  when  defining  block  diagonal  matrices  we 
will  use  the  notation  H = diag(i/jj)  and  for  stacked  matrices  or  vectors  we  will  use 
Z = [ Zij ].  Thus  we  could  define  the  complete  pseudo  observation  vector  as  y = [y^]. 

The  Newton-Raphson  algorithm  is  of  the  same  form  as  (2.8),  but  with  the  ex- 
pected information  matrix  FE  replaced  by  the  observed  information  matrix  Fobs.  The 
observed  information  matrix  is  defined  as  the  negative  of  the  second  derivative  of  the 
log-likelihood  with  respect  to  (3.  For  canonical  link  functions,  where  g(^)  = 6 , the 
observed  and  expected  information  matrices  coincide.  In  general,  the  contribution 
of  the  jth  observation  from  the  zth  subject  to  the  observed  information  matrix  is, 
suppressing  dependence  on  (3, 

Fo,ij  = FEtij  - Oij,  (2.11) 

where 

Oij  = Oij((3)  = Z'ijUijr(f3)Zij(yijr  - Trijr).  (2.12) 

r= 1 

The  matrix  Uijr  is  the  matrix  of  second  derivatives 

rfVrfaij) 

drjdr]1 

where  v(-)  was  defined  in  (2.2).  For  models  based  on  the  logit  link,  Uijr  has  the  form 

d2vr{Vij ) _ 1 d2hT  1 dhT  dhr  1 

drjdr]'  hT  drjdr]'  h%  drj  dr]'  + 1 - Yli=i  hi  drjdr]' 

, \ hi]  d[YH=  i hi]  , , 

(1-ELi  hi)2  drj  dr]'  ’ 

where  hr  is  the  rth  component  of  h(?n7).  Note  that  is  the  rth  column  of  A?-  To 

J dr] 

evaluate  (2.13),  one  also  needs  the  q by  q matrix  of  second  derivatives  ■ 


30 


2.4  Applications 

Though  we  will  present  the  random  effects  approaches  in  the  next  three  chapters 
in  a general  form,  we  will  examine  a number  of  specific  models.  For  nominal  data  we 
will  consider  the  baseline-category  logit  model,  while  for  ordinal  data  we  will  examine 
the  cumulative  logit,  adjacent-category  logit,  and  the  continuation-ratio  logit  models. 

2.4.1  Baseline-Category  Logit  Model 

Categorical  variables  that  do  not  have  a natural  ordering  for  their  levels  are  called 
nominal  variables.  A statistical  model  that  is  appropriate  for  assessing  the  influence  of 
explanatory  variables  on  a nominal  response  is  one  that  uses  baseline-category  logits 
(Agresti  1990,  p.  307).  We  refer  to  such  a model  as  the  baseline-category  logit  model, 
though  it  could  also  be  called  the  multinomial  logit  model  or  the  polychotomous 
logistic  regression  model. 

A common  application  of  baseline-category  logit  models  is  in  discrete-choice  mod- 
eling (Maddala  1983).  Such  models  often  appear  in  the  econometric  literature.  In 
discrete-choice  models,  subjects  are  presented  with  a set  of  R possible  choices.  Ex- 
planatory variables  in  discrete-choice  models  can  be  classified  as  either  a characteristic 
of  the  chooser  (e.g.,  gender,  race,  income)  or  a characteristic  of  the  choice  (e.g.,  price 
of  item,  color  of  item).  We  will  see  below  how  these  two  types  of  covariates  influence 
the  form  of  the  design  matrix. 

For  the  baseline-category  logit  model,  the  response  function  h(r/y)  has  compo- 
nents 

M»»b)  = eX,P(%>)  ■ r = (2-W) 

1 + E exp  (rjiji) 

i=i 


31 


Alternatively,  the  link  function  g(^ij)  has  components 


9r(*ij)  = log . (2.15) 

1 — S ^ijl 

1=1 

Thus  the  q logits  in  the  baseline-category  logit  model  are  formed  by  pairing  each 
response  category  with  a baseline  response  category,  which  we  take  to  be  the  last  one 
(category  R = q -f  1). 

In  the  baseline-category  logit  model  there  is  a separate  parameter  vector  /3r  for 
each  of  the  q logits.  Let  be  the  covariates  related  to  the  chooser  and  let  vr  be  the 
covariates  related  to  the  choice.  The  subscript  r refers  to  the  rth  choice  and  implies 
that  covariates  for  the  choice  are  the  same  across  all  subjects  and  observations.  Since 
logits  are  formed  by  pairing  responses  to  a baseline  category,  the  design  matrix 
including  both  types  of  covariates  has  the  form 


Zij  — 


x*  • 


i x: 


V 


2 - Vfi 


1 xL  v'  - v'R 


(2.16) 


For  design  matrix  (2.16),  the  parameter  vector  is 


/3'  = (#,•••  ,#,,V) 


where  (3'r  = (ar,/3r)  contains  a threshold  parameter  for  the  rth  logit  and  a co- 
variate parameter  vector  corresponding  to  the  chooser  covariates  for  the  rth  logit, 
r = l,  • ■ ■ ,q.  The  parameters  (3r  have  log  odds  interpretations  with  respect  to  the 
baseline-category.  The  final  parameter  vector  7 corresponds  to  the  differences  in  the 
covariate  vectors  between  the  paired  response  categories.  The  parameter  7 measures 
the  influence  of  the  characteristics  of  the  choices  and  its  interpretation  is  the  same 
across  all  logits. 


32 


The  derivative  matrix  Dij  = dh/dr]  evaluated  at  77^  = Z^/3  is  required  for  carying 

out  maximum  likelihood  estimation.  The  (u,v) th  element  in  the  q by  q matrix 

corresponds  to  the  derivative  ^ v For  the  baseline-category  logit  model,  takes 

dT]iju 

on  the  familiar  form 

= diag(iry)  - 

Since  the  link  function  is  the  same  as  the  natural  parameter  for  the  baseline-category 
logit  model,  Oij  in  (2.12)  is  zero. 

2.4.2  Adjacent-Category  Logit  Model 

We  now  consider  models  for  ordinal  data,  the  first  being  the  adjacent-category 
logit  model.  When  analyzing  such  response  variables,  a model  should  be  chosen 
that  will  account  for  the  ordering.  The  models  that  we  will  examine  incorporate 
the  ordering  directly  into  the  link  functions.  The  adjacent-category  logit  model  is 
a special  case  of  a model  originally  considered  by  Andersen  (1973)  and  is  described 
in  detail  in  Agresti  (1990).  As  the  name  implies,  adjacent  categories  of  the  ordinal 
response  are  used  to  form  logits.  When  all  covariates  are  categorical,  the  adjacent- 
category  logit  model  corresponds  to  a log-linear  model  with  scores  assigned  to  the 
ordinal  response.  Unlike  the  baseline-category  logit  model,  a common  association 
parameter  /3  is  usually  assumed  to  hold  for  all  adjacent-category  logits.  This  is  how 
the  model  treats  the  response  as  ordinal. 

The  form  of  the  response  and  link  functions  for  the  adjacent-category  logit  model  is 
very  similar  to  (2.14)  and  (2.15)  for  the  baseline-category  logit  model.  If  we  let  77? -r  = 
(r  — R)r]ijr,r  = 1,  • • ■ ,q,  then  the  components  of  the  response  and  link  functions 


33 


for  the  adjacent-category  logit  model  have  the  same  form  as  those  for  the  baseline- 
category  logit  model.  That  is 


hr(vti)  = 


1 + X]  expiViji) 

i= l 


r=  1, 


,Q, 


(2.17) 


and 


9r{Kj)  = log 


/ \ 

_* 

J [ • ■ 

IQT 

d“£^v 


(2.18) 


where  7r-jr  is  a function  of  the  altered  design  matrix. 

Since  the  adjacent-category  logit  model  has  a single  association  parameter,  the 
types  of  covariates  Xjj  for  each  pair  of  logits  must  be  the  same.  Therefore  the  design 
matrix  is  given  by 


1 


x/  • 


l x: 


(2.19) 


The  parameter  vector  /T  = (cti,  • • ■ , aq , f3*  ) includes  q threshold  parameters  and  the 
parameters  associated  with  the  covariate  vector.  These  parameters  have  log  odds 
interpretations  that  hold  across  all  adjacent  pairs  of  responses.  Calculated  from  the 
adjusted  design  matrix,  the  derivative  matrix  Dij  = diag(7r — 7r^-7r^  and  Oij  = 0. 


2.4.3  Cumulative  Logit  Model 

One  of  the  most  popular  ordinal  response  models  is  the  cumulative  logit  model 
(McCullagh  1980;  Agresti  1990,  sec.  9.4).  The  logits  in  the  cumulative  logit  model 
are  functions  of  cumulative  probabilities,  where  the  rth  logit  is  a logit  for  a binary 
response  in  which  categories  1 to  r are  paired  with  categories  r + 1 to  R.  The 


34 


association  parameter  (3  is  usually  assumed  to  be  the  same  across  all  logits  and  it  is 
interpreted  as  a cumulative  log  odds.  The  cumulative  log  odds  ratio  for  covariates 
xx  and  x2  then  is  proportional  to  the  difference  between  the  covariates,  which  holds 
across  all  logits.  Because  of  this  property,  the  model  is  often  referred  to  as  the 
proportional  odds  model.  Since  this  model  uses  groupings  of  categories  rather  than 
individual  categories,  cumulative  logit  models  are  not  equivalent  to  log-linear  models. 

The  components  of  the  response  function  for  the  cumulative  logit  model  take  the 
form 


TTiji  = hiirtij) 


ijr  h rijlij ) 


1 

1 + exp  (— Tfcj-x) 


1 + exp (-riijr)  1 + exp(-77ijjr_i)  ’ 


(2.20) 


and  the  components  of  the  link  function  satisfy 


9r(Vij) 


S>«i  N 


log 


Z=1 


1 X/  nijl 
1=1 


r = 2,  •••  ,q. 


(2.21) 


Since  the  parameter  vector  (3  is  constant  across  all  logits,  the  design  matrix  takes  the 
same  form  as  that  for  the  adjacent-category  logit  model  (2.19): 


(2.22) 

The  parameter  vector  (3 ' — («!,•••  ,aq,(3*)  includes  q threshold  parameters  and 
the  parameters  associated  with  the  covariate  vector.  The  thresholds  are  strictly 
stochastically  ordered  such  that  «i  < • • • < aq.  The  cumulative  logit  model  can 
be  motivated  from  a latent  variable  approach  (see  Section  3.2),  in  which  the  true 


35 


underlying  response  is  measured  only  in  relation  to  where  the  thresholds  lie.  For  this 
reason,  these  models  are  also  referred  to  as  threshold  models. 

Let  II ijr  = 7Tjji  + b 7 Xijr  and  let  4/ ijT  = ^>(1  — IT , r = 1,  • • • , q.  Then  the 

derivative  matrix  Dij  for  the  cumulative  logit  model  takes  the  form 


D 


'Fiji  -'Fiji 

^2 


-'Fii 


ij  2 


0 


\T>  - , 


_1F*j>9-i 


'F  ijq 


(2.23) 


For  the  cumulative  logit  model,  ^ 0^  and  so  the  matrix  Oij  is  not  zero.  To 

cPhr 

calculate  the  observed  information  matrix,  the  second  derivative  matrix 


d2hr 


driijdv'ij 


is 


needed  for  evaluating  (2.12).  For  11^,.  and  'F ijr  defined  as  above,  "r , has  ( u , u)th 

aVijdriij 

element 


d2hr 


dlJijud'TJijv 


^jr(i  - 2nijr) 

-^ijr(i  - 2nij,r_1) 


= o 


if  u = v = r , 


if  u = v — r — 1, 


otherwise, 


where  is  defined  to  be  zero. 

2.4.4  Continuation-Ratio  Logit  Model 

The  final  model  for  ordinal  data  is  the  continuation-ratio  logit  (CRL)  model  (Cox 
1972,  Agresti  1990,  p.  319).  Given  a set  of  multinomial  response  probabilities,  the 
continuation  ratio  is  defined  as  the  ratio  of  the  rth  multinomial  probability  over  the 
sum  of  the  remaining  r + 1 to  R probabilities.  Applications  of  the  CRL  model  include, 
for  example,  modeling  discrete  time  data  (see,  e.g.,  Ten  Have  and  Uttal  1994)  and 
modeling  data  in  which  the  ordering  of  the  response  is  due  to  a sequential  mechanism 


36 


(Fahrmeir  and  Tutz  1994,  sec.  3.3.4).  An  example  of  response  data  in  which  the 
ordered  categories  are  defined  sequentially  is  found  in  McCullagh  (1980).  The  data  set 
consisted  of  information  on  tonsil  sizes  of  children.  Each  child  was  classified  according 
to  their  relative  tonsil  size  and  if  they  had  Streptococcus  pyogenesis.  The  tonsil  size 
was  classified  as  “Present  but  not  enlarged”,  “Enlarged”,  or  “Greatly  enlarged”. 
Children  are  assumed  to  start  in  a normal  state  (category  1).  If  the  tonsils  start  to 
grow  abnormally,  they  could  become  enlarged  (category  2).  If  they  keep  growing, 
they  could  move  to  the  final  group  and  be  very  enlarged.  However,  to  get  to  the 
third  category,  the  tonsil  must  move  through  the  first  two  categories.  Therefore  the 
ordinal  responses  are  sequential  in  nature.  The  CRL  model  can  be  motivated  from 
underlying  latent  variables  based  on  this  sequential  mechanism  which  will  be  outlined 
in  Section  3.2.2. 

For  the  CRL  model,  the  components  of  the  response  function  take  the  form 
1 r“1  r 

hriVij) 


n 


1 


1 


1 + exp(-7for)  [ 1 + exp (-%■;). 


!>•••  ,0i 


(2.24) 


where  Jd°=1{-}  = ^ and  the  components  of  the  link  function  are 


9r{Vij)  = \og ^ , r = 1,  • ■ • ,q.  (2.25) 

ij,r+l  r ^ijq 

If  the  parameters  in  each  of  the  logits  for  the  CRL  models  are  distinct,  then  fitting 
the  models  separately  for  each  logit  with  yield  the  same  results  as  fitting  all  logits 
simultaneously  (Agresti  1990,  p.  319).  For  ordinal  models,  however,  one  can  assume 
that  the  association  parameter  /3  is  the  same  across  all  logits  and  has  the  form 
/ 3 1 = (ai,-  • • ,aq,f3*  ).  In  contrast  to  the  cumulative  logit  model,  the  thresholds  ar 
for  the  CRL  model  are  not  ordered.  The  design  matrix  for  the  model  with  common 


37 


effect  parameter  has  the  familiar  form 


Z* j — 


xb 


o 


1 xo 


(2.26) 


The  matrix  of  derivatives  Dij  is  somewhat  complicated  due  to  the  form  of  the 

response  function  (2.24).  Let  r,jr  = The  (u,  u)th  element,  duv,  of 

1 + exp(  fjijr ) 

has  the  form 


dia,  — 0 


= E ju  J^J(1  ~ Tjj;) 
i=i 

v—l 

= -FijuTijv  JJ(i  - r iji) 


if  u > v, 
if  u — v, 

if  u < v, 


(2.27) 


i=i 


where  f]°=1{'}  = 1-  In  addition,  to  calculate  the  observed  information  matrix,  the 

d?hr 

(u,u)th  element  of  the  matrix  of  second  derivatives  — — r , is  given  by 


driijdriij 


d2hT 


dljijudTjijy 


— rijr(l  - 2Tijr)  JJ(1  — Tjjj) 

i=i 

u 

= — TjjrrijU(i  — 2Tiju)  J^[(i  — r iji) 

i=i 

r 

— —FijrTijv  JJ(1  — Fiji) 

1=1 
V — l 

— TijrTijuTijv  TJ(1  f iji) 


if  r < u or  r < v, 
if  u — v = r, 

if  u = v ^ r, 
if  u — r,  v ^ r, 
if  u ^ v r, 


i=i 


where  fELif}  = 1- 


CHAPTER  3 

MULTIVARIATE  GENERALIZED  LINEAR  MIXED  MODELS  FOR  NOMINAL 

AND  ORDINAL  RESPONSE  DATA 

3.1  Introduction 

As  stated  in  Chapter  1,  there  has  been  relatively  little  research  in  the  area  of 
random  effects  models  for  nominal  and  ordinal  response  data.  The  majority  of  this 
work  has  been  focused  on  models  for  ordinal  responses  with  cumulative  logit  or  pro- 
bit links  that  have  allowed  only  simple  random  effects  structures  (Jansen  1990;  Ezzet 
and  Whitehead  1991),  or  have  been  based  on  Taylor  series  approximations  (Harville 
and  Mee  1984).  Special  models  for  correlated  discrete  failure  time  data  with  ordinal 
responses  have  also  been  considered  by  Ten  Have  and  Uttal  (1994)  and  Ten  Have 
(1996)  which  utilized  the  continuation-ratio  and  complementary  log-log  links,  respec- 
tively. In  the  former,  estimation  was  carried  out  by  way  of  the  Gibbs  sampling  routine 
by  first  assuming  a noninformative  prior  distribution  for  the  regression  parameters. 
Only  recently  has  a general  approach  for  modeling  clustered  ordinal  response  data 
been  presented  (Hedeker  and  Gibbons  1994;  Tutz  and  Hennevogl  1996). 

Hedeker  and  Gibbons  (1994)  and  Tutz  and  Hennevogl  (1996)  proposed  similar 
models  for  repeated  ordinal  responses,  though  they  considered  quite  different  esti- 
mation routines.  Hedeker  and  Gibbons  (1994)  considered  a general  random  effects 
model  for  ordinal  data  using  either  the  cumulative  logit  or  probit  links.  They  directly 
maximized  the  marginal  likelihood  obtained  by  approximating  the  normal  integrals 
by  Gauss-Hermite  quadrature.  Tutz  and  Hennevogl  (1996)  also  proposed  an  ordinal 
regression  model  that  allowed  for  a general  random  effects  structure.  They  motivated 
their  model  as  a multivariate  generalized  linear  mixed  model  and  considered  Gauss- 
Hermite  quadrature  and  Monte  Carlo  EM  algorithms  to  maximize  the  log-likelihood. 


38 


39 


Tutz  and  Hennevogl  (1996)  also  proposed  an  ordinal  model  in  which  each  threshold 
was  assumed  to  be  random.  They  relaxed  the  usual  assumption  in  which  all  thresh- 
olds for  a given  subject  are  shifted  randomly  by  the  same  amount,  and  allowed  each 
threshold  to  vary  according  to  its  own  distribution.  Estimation  in  this  general  model 
is  more  difficult  since  the  order  restriction  on  the  thresholds  (see  Section  2.4.3)  may 
be  violated  if  the  variabilities  of  the  thresholds  are  large  or  the  thresholds  are  not 
well  separated. 

Random  effects  models  for  nominal  response  data  have  received  even  less  attention 
in  the  statistical  literature.  Fahrmeir  and  Tutz  (1994,  p.  231)  outlined  a baseline- 
category  logit  model  which  allowed  for  random  thresholds.  Hedeker  (2000)  proposed  a 
similar  baseline-category  model  and  provided  a Fortran  program  that  approximated 
the  normal  integrals  using  Gauss-Hermite  quadrature.  A notable  deficiency  in  his 
proposed  model,  however,  was  the  inability  to  estimate  correlations  between  random 
effects  in  different  thresholds.  For  example,  ffedeker  (2000)  allowed  each  threshold 
for  a given  subject  to  vary  according  to  its  own  distribution,  but  assumed  that  the 
threshold  random  effects  were  perfectly  correlated.  One  would  expect  that  thresholds 
from  the  same  subject  would  be  correlated,  but  assuming  that  the  correlation  is  always 
1.0  is  an  overly  strong  assumption. 

Some  special  cases  of  the  baseline-category  logit  random  effects  model  have  been 
examined  in  the  psychometric  literature.  In  psychometric  research  a popular  model 
for  analyzing  item  response  data  is  the  Rasch  model  (Rasch  1961).  In  such  models 
it  is  assumed  that  the  items  (questions)  being  measured  describe  some  underlying 
latent  trait,  or  traits,  of  a set  of  cases  (subjects).  Questions  on  standardized  tests,  for 
example,  are  used  to  assess  the  underlying  verbal  or  mathematical  ability  of  students. 
When  the  items  being  measured  have  nominal  responses,  the  baseline-category  logit 
model  can  be  used  to  estimate  the  item  parameters.  Adams  and  Wilson  (1996) 
considered  such  a model,  but  allowed  the  thresholds  to  be  shifted  for  each  subject. 


40 


This  model  was  then  extended  by  Adams  et  al.  (1997)  to  allow  the  thresholds  to 
vary  individually  for  each  subject.  For  both  models,  an  EM  algorithm  was  used 
to  maximize  the  marginal  log-likelihood.  Two  approaches  were  used  to  obtain  the 
marginal  log-likelihood.  In  the  first  approach,  the  random  effects  were  assumed  to 
be  multivariate  normal,  and  a grid  of  points  was  chosen  at  which  the  multivariate 
normal  density  was  approximated.  The  points  and  the  approximate  weights  were  then 
used  to  approximate  the  integrals.  In  the  second  approach,  the  random  effects  were 
assumed  to  follow  a discrete  step  distribution  defined  on  a prespecified  set  of  nodes. 
The  density  values  at  the  nodes  of  the  discrete  step  distribution  were  estimated  within 
the  EM  algorithm. 

In  this  chapter  we  propose  general  random  effects  models  for  nominal  and  ordinal 
response  data.  We  motivate  these  models  as  extensions  of  the  multivariate  generalized 
linear  model  considered  in  Section  2.1.  The  resulting  multivariate  generalized  linear 
mixed  model  provides  a unified  framework  from  which  various  models  for  nominal 
and  ordinal  data  can  be  motivated.  In  particular,  we  consider  four  multinomial  logit 
models  that  have  link  functions  based  on  the  logit  link.  For  nominal  data  we  consider 
the  baseline-category  logit  model  and  for  ordinal  data  we  consider  the  continuation- 
ratio  logit  model,  the  adjacent-category  logit  model,  and  the  cumulative  logit  model. 
For  the  baseline-category  logit  model,  we  allow  for  a general  random  effects  structure 
which  includes  correlated  random  effects  between  thresholds,  in  contrast  to  Hedeker 
(2000).  This  approach  is  more  general  then  that  of  Adams  and  Wilson  (1996)  and 
Adams  et  al.  (1997),  in  that  it  includes  their  models  as  special  cases.  We  also  gen- 
eralize the  work  of  Ten  Have  and  Uttal  (1994)  on  the  continuation-ratio  logit  model 
by  considering  a general  regression  model  for  ordinal  responses.  The  proposed  model 
for  the  adjacent-category  logit  link,  to  our  knowledge,  has  not  been  considered  previ- 
ously. Our  random  cumulative  logit  model  is  similar  to  that  of  Hedeker  and  Gibbons 
(1994)  and  Tutz  and  Hennevogl  (1996),  but  we  employ  a different  estimation  routine. 


41 


In  particular,  for  all  models  we  utilize  adaptive  multivariate  Gauss-Hermite  quadra- 
ture to  numerically  approximating  the  intractable  multivariate  normal  integrals,  and 
then  proceed  by  directly  maximizing  the  marginal  log-likelihood.  This  approach  has 
not  been  utilized  before  for  multinomial  response  models.  We  also  apply  the  Monte 
Carlo  EM  algorithm  of  Booth  and  Hobert  (1999)  as  an  alternative  estimation  routine 
for  high  dimensional  random  effects  models. 

In  addition  to  the  proposed  maximum  likelihood  methods,  we  also  generalize 
the  work  of  Wolfinger  and  O’Connell  (1993)  to  allow  for  approximate  inference  in 
mixed  nominal  and  ordinal  regression  models.  Keen  and  Engel  (1997)  proposed  an 
iteratively  re- weighted  REML  estimation  routine  for  ordinal  response  data  which  used 
minimum  norm  quadratic  estimation  (MINQUE)  (Rao  1973)  to  obtain  estimates 
of  the  variance  components.  An  advantage  of  the  MINQUE  estimation  method  is 
that  it  is  noniterative,  providing  method  of  moment  type  estimators  for  the  variance 
components.  A disadvantage,  however,  is  that  MINQUE  estimates  can  be  negative. 
Swallow  and  Monahan  (1984)  recommended  the  use  of  ML  or  REML  estimates  over 
MINQUE  based  on  results  of  a series  of  simulation  studies.  Our  extension  of  the 
methods  of  Wolfinger  and  O’Connell  (1993)  is  based  on  a pseudo-likelihood  approach 
that  utilizes  REML  estimation  for  the  variance  components.  We  again  motivate  the 
model  in  terms  of  a multivariate  generalized  linear  mixed  model,  which  allows  for 
simple  application  to  the  links  discussed  above. 

The  remainder  of  the  chapter  is  structured  as  follows:  We  begin  in  Section  3.2  by 
defining  the  multivariate  generalized  linear  mixed  model.  Within  that  section  we  also 
consider  the  motivation  of  the  nominal  and  ordinal  mixed  models  as  extensions  of 
linear  mixed  models.  In  Section  3.3  we  discuss  the  estimation  methods  for  obtaining 
maximum  likelihood  estimates  of  the  regression  parameters  and  variance  components. 
We  provide  details  for  obtaining  estimates  of  the  standard  errors  upon  convergence 
of  the  algorithms  and  for  carrying  out  inferences  in  Section  3.4.  We  then  present  in 


42 


Section  3.5  an  approximate  maximum  likelihood  method  for  fitting  models  for  nominal 
and  ordinal  response  data.  In  Section  3.6  we  apply  the  models  of  this  chapter  to  a 
number  of  datasets.  We  conclude  in  Section  3.7  by  considering  the  extended  random 
threshold  model  of  Tutz  and  Hennevogl  (1996). 

3.2  Multivariate  Generalized  Linear  Mixed  Models 
Multivariate  generalized  linear  mixed  models  (MGLMMs)  are  extended  multivari- 
ate generalized  linear  models  that  incorporate  random  effects  linearly  along  with  the 
fixed  effects  in  the  linear  predictor.  In  this  section  we  define  MGLMMs  and  show 
that  multinomial  random  effects  models  are  special  cases  of  MGLMMs.  We  then  show 
that  the  multinomial  random  effects  models  under  consideration  can  be  motivated 
from  an  underlying  latent  variable  that  follows  a linear  mixed  model. 

3.2.1  Definition 

As  in  Section  2.1,  let  yL  = (j/pi,  ...,  Vijq)  be  a g-dimensional  response  vector  with 
corresponding  p-dimensional  covariate  vector  xL  = (:rpi, ... ,Xijp ),  j = 1,  • • • , Tj,  i = 
1,  • • • ,ra,  and  denote  the  fixed  effects  parameter  vector  by  (3 . Also,  for  the  7th  subject 
let  u(  = (un,  • • • , Uim)  be  an  m-dimensional  vector  of  subject-specific  random  effects. 
The  multivariate  generalized  linear  mixed  model  is  defined  by  the  following  two-stage 
model. 

1.  Stage  1: 

Assume  that  given  the  random  effects  u,,  the  distribution,  /(y,j  | vp;  (3),  of  y,j 
is  a member  of  the  multivariate  exponential  family  with  conditional  mean 

Vij  - E{yij  | Ujj)  - h(i|y)  and  r]tj  = Zl3(3  + Wl3uu  (3.1) 

where  the  response  function  h and  the  design  matrix  are  defined  as  in  Chap- 
ter 2,  and  Wi3  denotes  the  design  matrix  for  the  covariates  that  are  assumed  to 
vary  across  subjects. 


43 


2.  Stage  2: 

Assume  that  the  subject-specific  random  effects  iq  are  independent  and  follow 
a distribution  G with  mean  0 and  positive  definite  variance-covariance  matrix 
£. 

In  addition,  it  is  assumed  that  observations  within  a subject  are  conditionally  (on  u,) 
independent,  and  observations  between  subjects  are  conditionally  and  unconditionally 
independent.  Thus  the  conditional  density  of  the  complete  response  vector  y and 
random  effects  vector  u can  be  written 

n Ti 

/(y  I /3;  u)  = Yl  n /(y<i  I 0; u*)- 

t=i  j—i 

Note  that  Stage  1 requires  that  conditional  on  Uj,  y^  follows  a multivariate  general- 
ized linear  model  as  defined  in  Section  2.1. 

The  usual  assumption  that  is  made  concerning  G,  and  the  one  that  we  make  in  this 
chapter,  is  that  G is  the  multivariate  normal  distribution  with  mean  0 and  covariance 
matrix  £.  In  general,  however,  any  distribution  can  be  chosen  for  G.  For  example,  in 
the  next  chapter  we  consider  G to  be  a discrete  distribution  with  unknown  support 
size,  masses,  and  mass  points.  The  assumption  of  normality  is  popular  as  it  allows 
for  a variety  of  covariance  structures  for  the  random  effects.  It  does,  however,  create 
difficulties  for  estimation  of  the  fixed  and  random  effects  since  this  entails  maximizing 
the  marginal  likelihood  of  the  data.  Obtaining  the  marginal  likelihood  is  hampered 
by  the  intractable  integrals  of  the  normal  distribution.  In  Section  3.3  we  present 
an  estimation  routine  that  uses  adaptive  Gauss-Hermite  quadrature  to  obtain  the 
marginal  likelihood. 

As  was  shown  in  Section  2.1,  the  multinomial  logit  models  that  are  being  consid- 
ered fulfill  the  definition  of  a multivariate  generalized  linear  model.  Thus  multinomial 
logit  models  of  the  form  = Zjj/3  + ITjjiij,  where  the  random  effects  iq  satisfy  Stage 


44 


2 of  the  definition  above,  can  be  considered  as  MGLMMs.  Such  a result  is  advanta- 
geous for  a number  of  reasons.  As  MGLMMs  are  extensions  of  multivariate  general- 
ized linear  models,  the  score  and  information  functions  have  known  forms  which  can 
be  utilized  in  the  maximum  likelihood  estimation  routines.  The  MGLMM  framework 
also  provides  a unified  approach  to  modeling  for  the  class  of  multinomial  models. 
Maximum  likelihood  algorithms  can  be  defined  for  the  general  MGLMM  and  then 
modified  appropriately  for  the  link  function,  response  function,  and  random  effects 
structure  under  consideration.  For  these  reasons  we  present  the  fitting  algorithms  in 
Sections  3.3,  3.5,  and  4.2  in  terms  of  a MGLMM  for  multinomial  response  data. 

3.2.2  Motivation  of  the  Multinomial  Logit  Models 

We  have  seen  in  the  previous  section  that  multinomial  logit  random  effects  models 
are  special  cases  of  MGLMMs.  One  can  also  motivate  these  models  by  assuming 
that  the  true  response  is  an  underlying  latent  variable  that  follows  a linear  mixed 
model.  This  approach,  using  a linear  fixed  effects  model,  has  been  used  to  motivate 
the  fixed  effects  versions  of  the  baseline-category  logit  model,  the  continuation-ratio 
logit  model,  and  the  cumulative  logit  model.  Thus,  in  the  sections  below,  we  begin  by 
motivating  the  fixed  effects  models  and  then  extend  the  motivations  to  include  random 
effects.  The  adjacent-category  logit  model,  however,  lacks  a meaningful  motivation 
based  on  an  underlying  latent  variable.  We  present  a latent  variable  motivation  that 
leads  to  the  adjacent-category  logit  model,  but,  admittedly,  lacks  the  interpretability 
of  the  other  motivations. 

Baseline-Category  logit  model 

The  baseline-category  logit  model  can  be  motivated  from  the  consideration  of  an 
underlying  latent  variable  by  using  the  principle  of  maximum  random  utility.  The 
concept  of  random  utility  arose  out  of  psychological  research  by  Thurstone  (1927) 
and  has  been  applied  in  many  areas  such  as  consumer  theory,  transportation  theory, 
and  behavioral  theory  (see,  e.g.,  Ben-Akiva  and  Lerman  1985).  The  basic  theory  is 


45 


that  a consumer  who  is  faced  with  a finite  set  of  choices,  will  select  the  option  that 
provides  him  or  her  the  maximum  use  or  utility.  Since  it  is  impossible  to  predict  the 
chosen  alternative  for  all  consumers  due  to,  for  example,  possible  misperceptions  of 
choices  by  individuals,  Thurstone  (1927)  considered  the  true  utilities  of  the  possible 
choices  as  random  variables.  Thus  the  probability  that  a particular  alternative  is 
chosen  is  defined  as  the  probability  that  it  has  the  greatest  utility  among  all  possible 
choices. 

We  first  consider  the  motivation  of  the  fixed  effects  baseline-category  logit  model, 
in  which,  for  convenience,  we  suppress  the  subscripts  i and  j.  For  a nominal  response 
variable  Y,  it  is  assumed  that  a latent  variable  Y*  is  associated  with  the  rth  category 
or  choice,  r = 1, R = q + 1.  Y*  can  be  thought  of  as  a measure  of  the  utility  of  the 
rth  category.  Proceeding  as  in  Fahrmeir  and  Tutz  (1994,  p.  70),  let  Y*  be  denoted 
as 


where  Ur  = ar  + x'Br  = z '(3*  is  the  unobserved  utility  for  the  rth  alternative  and 
ei,  ...,£#  are  random  variables  with  some  continuous  distribution  function  F.  It  is 
assumed  that  the  unobserved  utility  UT  depends  on  a vector  of  covariates  z'  = (l,x') 


random  utility  assumes  that  choice  r will  be  chosen  if  it  provides  the  maximum 
perceived  utility  for  the  consumer.  That  is,  the  observed  response  Y takes  on  the 
value  r according  to 


(3.2) 


with  corresponding  parameter  vector  (3*  = (ar,B'r).  The  principle  of  maximum 


Y = r^Y*  = max  Y,*. 

r l=l,...,R  ‘ 


(3.3) 


Thus  the  response  category  r is  chosen  if  the  underlying  latent  variable  Y*  has  the 
maximum  utility. 


46 


Let  /(e)  be  the  density  function  of  e and  assume  that  the  {er}  are  independent. 
Given  the  relationship  between  Y and  Y*  defined  in  (3.3)  and  the  model  definition 
(3.2)  for  Y*,  the  probability  that  a consumer  chooses  alternative  r is 


P(Y  = r)  = p{y;  - Y*  > 0, Yr*  - Y£  > 0) 


-P(e i < Ur  — U\  + er, Cfc  < UT  — Ur  + er) 

/OO 

] \ F(Ur  — Us  + e)  /(e)  de. 

■°° 


(3.4) 


The  baseline-category  logit  model  is  motivated  by  assuming  the  {er } follow  the  ex- 
treme value  distribution,  whose  distribution  is  defined  by  F(x)  = exp(—  exp(— x)). 
Under  these  assumptions,  the  integrand  of  (3.4)  takes  the  form 


] J F(Ur  - Us  + e)  /(e)  =JJexp(-e  Ur+Us  £)exp(-e-e  £) 


Letting  A — log(^)^=1  ^)  in  (3.5)  and  defining  e*  = e — A,  the  integration  in  (3.4) 
yields 


e-(£-A))  de 


-e~£’)  de* 


= exp(— A). 


Thus  the  probability  that  the  rth  alternative  is  chosen  is 

P(Y  = r)= 

E exp(C/4) 


S = 1 


exp (Ur  - UR) 

1 + E exp  (Us  - Ur) 

S= 1 


(3.6) 


47 


By  substituting  Ur  = Ur  — UR  into  (3.6),  one  obtains  the  baseline-category  logit  model 
(2.14),  where  Ur  = (ar  - aR)  + x'(/3*  - (3R)  = z ' (3T. 

Now  consider  a nominal  response  variable  for  the  j th  observation  on  the  ith 
subject  with  corresponding  latent  response  Y*-T  for  the  rth  alternative.  The  general 
baseline-category  logit  random  effects  model  is  motivated  by  assuming  that  the  rth 
unobserved  latent  response  follows  the  mixed  linear  model 

^ ijr  = Uijr  + f-ijr  = zijPr  + wijuir  4“  c ijri  (3-7) 

where  Uijr  now  includes  cluster-  and  category-specific  random  effects  u*r.  As  before, 
the  {tijr}  are  independently  and  identically  distributed  random  variables  that  follow 
the  extreme  value  distribution,  F.  In  addition,  the  distribution  of  the  cluster-specific 
random  effects  uf  = (u*{,  ■ • • , u*)j)  is  assumed  to  be  multivariate  normal  with  mean  0 
and  covariance  £*,  where  the  {u*}  are  distributed  independently  of  the  {e^,.}.  Note 
that  imposing  a multivariate  distribution  on  the  vector  of  cluster-specific  random 
effects  u*  allows  the  category-specific  random  effects  to  be  correlated. 

As  in  Stage  1 of  the  definition  of  the  multivariate  generalized  linear  mixed  model 
given  in  Section  3.2.1,  motivation  proceeds  by  conditioning  on  the  random  effects  u*. 
That  is,  given  the  unobserved  random  effects  u*,  the  nominal  response  Yij  takes  the 
value  r according  to 


YH  = r I Ui  **  Yi*jr  = (3-8) 

Using  the  same  approach  as  for  the  fixed  effects  model,  the  probability  that  Y^  = r 
given  the  random  effects  u*  is 


P(Ya  = r | u*) 


exp  ( Uijr  UijR) 

i 

1 T exp ( Uij s UiiR) 

s= 1 


48 


By  letting  (3r  = (3*  — /3*R  and  uir  = u*r  — u*^,  the  model  takes  on  the  usual  form 


P(Yij  = r | Ui) 


exp(r/tJ>) 

1 + X)  exp(%s) 

S = 1 


(3.9) 


where  riijr  = z^f3r-\-w'^\iir  and  u-  = (uilf  • • • , uig),  with  Uj  distributed  as  a multivari- 
ate normal  with  mean  0 and  covariance  matrix  E.  In  matrix  notation,  the  random 
effects  baseline-category  logit  model  takes  the  form  r]i:j  = Zij/3  + WjjUj,  with  Zij  and 
/3  defined  as  in  2.3.1  and  Wij  defined  according  to  the  random  effects  structure.  For 
example,  to  allow  each  threshold  to  be  random,  W,j  would  have  the  form 


Wtj  = 


and  Ui  = (uu,  ■ ■ ■ ,Uiq).  Since  the  thresholds  are  not  ordered  in  the  baseline-category 
logit  model,  there  are  no  complications  encountered  in  the  estimation  procedure  when 
they  are  allowed  to  vary  individually.  Though  the  motivation  in  this  section  included 
only  covariates  related  to  the  subject,  z ij,  inclusion  of  covariates  constant  across 
subjects  and  specific  to  the  alternatives  is  straightforward  as  shown  in  Section  2.4.1. 
Continuation-Ratio  logit  model 

As  noted  in  Section  2.3.4,  the  continuation-ratio  logit  model  is  useful  for  modeling 
data  in  which  the  ordering  of  the  responses  is  due  to  a sequential  mechanism.  Re- 
call that  in  the  example  of  that  section,  tonsils  could  not  be  categorized  as  “Greatly 
enlarged”  unless  they  had  passed  through  the  “Present  but  not  enlarged”  and  “En- 
larged categories.  One  method  of  motivating  this  model  is  to  consider  an  underlying 
latent  response  which  follows  a similar  sequential  process  (Fahrmeir  and  Tutz  1994, 
p.  85).  Again  considering  the  fixed  effects  model  first  and  suppressing  subscripts,  let 
Ui,  ■ ■ ■ , Uq  denote  q latent  variables  defined  by  the  linear  model  Ur  = — x'7-f-er,  where 


49 


er  has  distribution  function  F.  Considering  the  response  mechanism  as  sequential, 
the  ordinal  response  Y starts  with  the  value  one  according  to 


Y — 1 <=>  U\<  aq, 

where  a\  denotes  a threshold  parameter.  If  U\  exceeds  aq,  the  response  Y proceeds 
to  category  two  where 


Y — 2 given  Y > 2 U2  < c*2- 

For  the  rth  category,  the  relation  is  given  by 

Y = r given  Y > r <=>  Ur  < ar. 

The  sequential  process  continues  until  Ur  does  not  exceed  the  threshold  ar,  at  which 
point  the  response  r is  observed. 

By  replacing  UT  with  its  linear  model  representation  and  by  considering  the  equiv- 
alent representation  of  the  sequential  process 

Y > r given  Y > r Ur  > ar,  (3.10) 

the  probability  that  Y — r given  that  Y > r is  given  by 

P(Y  = r \Y  > r)  = F(ar  + x 7),  r = 1,  • • • , R,  (3-11) 

where  = 00.  The  continuation-ratio  logit  model  is  obtained  by  taking  F to 
be  the  logistic  distribution,  F(x ) = 1/(1  + exp(— x)).  Denoting  r]r  = ar  + x'7,  the 
unconditional  probability  that  the  ordinal  response  Y takes  the  value  r,  r = 1,  • • • , R, 


50 


is  given  by 


P(Y  = r)  = P(F  = r | y > r)  P(Y  > r) 


r — 1 


= f for)  n I1  - 


1=1 


1 

1 +exp(— 7/r) 


r — 1 r 


n 

/=i  L 


i - 


1 + exp(-7?;)J  ’ 


(3-12) 


where  IlLii  } = L 

Extending  the  continuation-ratio  logit  model  to  include  random  effects  is  straight- 
forward in  the  motivation  given  above.  Reintroducing  subscripts,  let  the  R latent 
variables  for  j th  observation  and  the  ith  subject  be  denoted  by  I/^i,  • ■ • ,UijR.  To 
incorporate  random  effects,  one  now  assumes  that  Uijr  = — x'^-y  — w-Uj  + t{jr , where 
u,  ~ MV N(0,Y,)  is  a vector  of  cluster-specific  random  effects,  with  the  {u,}  dis- 
tributed independently  of  the  {eijr}.  The  sequential  mechanism  defined  in  (3.10)  is 
now  assumed  to  hold  conditionally  on  the  random  effects  Uj.  That  is 


{F  > r | uj  given  {Y  > r \ uj  Ur  > ar. 

Then  the  conditional  probability  that  Yij  = r given  u,  takes  the  same  form  as  (3.12) 
with  linear  predictor  r)ijr  = ar  + x^7  + w^-Uj  = z + w^Uj.  In  terms  of  the 
design  matrices,  the  continuation-ratio  logit  model  can  be  written  in  familiar  form 
“Hi j Fjj/3  T Wijiii. 

Cumulative  logit  model 

As  in  the  continuation-ratio  logit  model,  the  cumulative  logit  model  can  be  mo- 
tivated from  a threshold  approach,  which  has  been  shown  by  numerous  authors  (see, 
e.g.,  Tutz  and  Hennevogl  1996).  For  this  approach,  one  assumes  that  the  observed 
categorical  response  Y is  a categorized  version  of  an  underlying  latent  continuous  re- 
sponse Y*.  Observations  of  Y*  are  obtained  only  through  the  less  precise  categorized 


51 


response  Y,  which  is  defined  by  a set  of  thresholds  (au, • • • , aq).  That  is 

Y = r ay_i  < Y*  < ar,  r — (3.13) 


where  a0  = — oo  < a\  < ■ ■ ■ < = oo.  Since  the  {ar}  divide  the  continuous 

response  Y*  into  categories,  they  are  often  referred  to  as  cut-points.  To  motivate 
the  cumulative  logit  model,  it  is  assumed  that  the  latent  response  Y*  follows  a linear 
model  Y*  = -x'7  + e,  where  x is  an  observed  covariate  vector  with  corresponding 
parameter  vector  7,  and  e follows  the  distribution  function  F. 

For  the  general  cumulative  model,  the  probability  of  observing  a response  Y < r 
is  given  from  (3.13)  by 


P(Y  < r)  = F(ar  + X7),  r = l,---,q. 

A variety  of  cumulative  link  models  can  be  defined  by  specifying  F,  such  as  the 
cumulative  probit  model  when  F is  the  standard  normal  distribution.  To  obtain  the 
cumulative  logistic  model,  F is  taken  to  be  the  logistic  distribution  function.  Hence, 
the  probability  that  Y takes  on  the  value  r is 


P(Y  = r ) 


l + exp(-?yr)  1 + exp(-77r_i)  ’ 


r = 1, 


1 q, 


(3.14) 


where  T]r  = ar  + x 7 and  cro  = — 00.  We  note  that  as  a special  case  of  the  cumulative 
logit  motivation,  binary  logistic  models  can  also  be  motivated  using  the  threshold 
approach,  where  the  underlying  response  Y*  is  categorized  by  a single  threshold  a. 

To  allow  for  random  effects,  the  underlying  latent  variable  Y*-  is  assumed  to  follow 
a linear  model  with  both  fixed  and  random  components.  That  is,  Y{*  = — x^ -7  — 
Wyiij  + 6ij,  with  Uj  ~ MVN(0 , E),  i = 1,  • • • ,n,  distributed  independently  of  the 
{e;j}.  The  relationship  between  the  observed  response  Yij  and  the  latent  response  Y£ 


52 


is  now  defined  conditionally,  such  that 

Y — r | Uj  ar_ i < Y*  < ar,  r = 1,  • • • , R.  (3.15) 

The  response  probability  (3.14)  is  now  defined  conditionally  on  Uj,  which  follows 
directly  from  (3.15).  The  resulting  linear  predictor  is  r]ijT  = z [j/3  + w^Uj.  In  Section 
3.7  we  will  consider  the  motivation  of  the  extended  threshold  model  of  Tutz  and 
Hennevogl  (1996)  which  allows  each  threshold  to  vary  individually. 
Adjacent-Category  logit  model 

The  motivations  of  the  previous  three  models  have  been  based  on  realistic  mech- 
anisms for  relating  the  latent  response  to  the  observed  response.  For  example,  the 
baseline-category  logit  model  was  based  on  the  psychological  principle  of  maximum 
random  utility.  For  the  adjacent-category  logit  model,  however,  a meaningful  mo- 
tivation has  not  been  established.  Recall  that  in  this  model  a common  association 
parameter  (3  is  assumed  to  hold  for  all  logits  constructed  from  adjacent  response  cate- 
gories. As  the  model  is  defined  for  ordinal  responses,  one  might  consider  a motivation 
based  on  a threshold  approach.  However,  the  threshold  approach  is  not  appropriate 
since  the  thresholds  in  the  adjacent-category  logit  model  need  not  be  ordered.  The 
sequential  process  of  the  ordinal  continuation-ratio  logit  model  does  not  apply  as 
well.  We  outline  below  a motivation  for  the  adjacent-category  logit  model  that  par- 
allels that  of  the  baseline-category  logit  model.  The  motivation  does  lead  to  the 
adjacent-category  logit  model,  however  it  lacks  the  realistic  interpretation  that  the 
other  motivations  possess. 

Though  the  model  is  defined  for  ordinal  responses,  the  adjacent-category  logit 
model  has  response  probabilities  that  are  very  similar  to  that  of  the  baseline-category 
logit  model  (compare  (2.17)  with  (2.14)).  Thus  one  approach  for  motivating  the 
adjacent-category  logit  model  is  that  based  on  a modified  principle  of  maximum 
random  utility.  Recall  that  the  principle  of  maximum  random  utility  assumes  that  a 


53 


subject,  given  a set  of  R nominal  choices,  will  choose  the  alternative  that  maximizes 
his  or  her  utility.  For  the  adjacent-category  motivation,  one  could  assume  that  a 
subject  chooses  from  a set  of  R ordered  alternatives,  where  the  utility  Y*  for  the  rth 
ordered  alternative  is  defined  by  the  model 

Yr*  = Ur  + er  = r ar  + rx  B + er,  r = l,---,R.  (3.16) 

Note  that  (3.16)  differs  from  (3.2)  in  two  respects.  First  (3.16)  has  a modified  linear 
predictor  in  that  it  is  scaled  by  the  ordered  category  choice,  and  (3.16)  has  a com- 
mon parameter  B across  all  utilities.  The  relationship  between  the  observed  ordinal 
response  Y and  the  latent  responses  Y*,  r = 1,  • • • , R is  given  by 


Y = r <f$Y*  = max  Y,*. 

1=1, -,R 


(3.17) 


By  assuming  that  er  in  (3.16)  follows  the  extreme  value  distribution  and  then  cal- 
culating the  probability  that  Y — r according  to  (3.17),  the  probability  of  the  rth 
ordered  alternative  can  be  shown  to  be 


P{Y  = r ) 


exp  (r/r) 

Q 

i + E exp(^) 

i=i 


r = 1 


(3.18) 


where  r/r  = (r  — R)  z! (5  with  z = (1,  x')  and  (3  = (ai,  • • • , aq,  B).  The  inclusion  of 
random  effects  is  straightforward,  following  the  same  conditioning  arguments  found 
in  the  baseline-category  logit  motivation. 

One  criticism  of  the  motivation  given  above  is  that  the  principle  of  maximum 
random  utility  was  conceived  for,  and  intuitively  makes  sense  for  situations  in  which 
the  possible  alternatives  are  nominal.  However,  one  could  argue  that  the  use  of  the 
principle  of  maximum  random  utility  in  the  ordinal  setting  could  be  appropriate  as 
well.  Consider  a situation  in  which  a group  of  subjects  is  asked  to  rate  their  job 
satisfaction  on  an  ordinal  scale  from  “Very  dissatisfied”  to  “Very  satisfied”.  The 


54 


response  given  by  a subject  is  the  one  that  most  epitomizes  or  “maximizes”  their 
feelings.  A second  criticism  is  that  the  form  of  the  utility  model  given  in  (3.16)  is 
contrived.  It  is  obviously  chosen  so  that  the  desired  form  of  the  adjacent-category 
logit  model  is  obtained.  However,  if  one  wished  to  motivate  a baseline-category  logit 
model  that  allowed,  say,  alternative-specific  covariates  which  varied  across  choosers, 
one  would  simply  modify  (3.2)  appropriately  so  that  the  desired  model  was  obtained 
as  well.  Thus  we  contend  that  the  given  motivation  can  be  appropriate  for  the 
adjacent-category  logit  model. 


3.3  Maximum  Likelihood  Estimation 


From  the  definition  given  in  Section  3.2.1  and  the  independence  of  observations 
between  clusters,  the  form  of  the  marginal  likelihood  for  the  multivariate  generalized 
linear  mixed  model  with  response  vector  y ^ is  immediately  given  by 


mo)  = fiJ- 1 


T, 


I /3;u*) 

J=1 


dG(\ii). 


For  the  multinomial  random  effects  models  considered  in  this  chapter,  /(•  | /3;u*)  is 
the  multinomial  distribution  and  G( Uj)  is  the  multivariate  normal  distribution  with 
mean  0 and  covariance  S.  Thus  the  marginal  likelihood  for  the  general  multinomial 
random  effects  model  is 


LC8,E)=J \j  ■■■  J 


Y[f{yij  I /3;u i) 

.3=1 


(2tt)  m/2  | £ | 1/2  exp(-^u-£  xut)  du*, 

(3.19) 


where  u = (u1?  • • • , u)J  and  m is  the  dimension  of  the  random  effects  vector  u;.  Note 
that  the  multinomial  distribution  f{ytj  | /3;  u,)  in  (3.19)  is  defined  in  terms  of  the 
scaled  multinomial  response  y ^ = yij/n,j,  where  ntJ-  is  the  multinomial  index  associ- 
ated with  the  response  y^.  In  many  longitudinal  datasets  with  categorical  responses, 
the  multinomial  index  for  each  observation  will  be  one.  We  will,  however,  define  the 


55 


general  multinomial  random  effects  model  in  terms  of  y^,  with  the  understanding 
that  riij  may  be  one  for  all  subjects  and  observations. 

In  this  section  we  present  methods  for  finding  maximum  likelihood  estimates  of  f3 
and  £ for  the  general  multinomial  random  effects  model.  Finding  maximum  likelihood 
estimates  entails  maximizing  the  marginal  likelihood  (3.19).  To  do  this,  however,  one 
must  first  evaluate  an  intractable  m-dimensional  integral  for  each  cluster.  Thus  meth- 
ods for  finding  maximum  likelihood  estimates  must  incorporate  both  a maximization 
routine  and  an  algorithm  for  approximating  integrals.  Technically,  such  methods  pro- 
vide only  approximate  maximum  likelihood  estimates  since  the  intractable  integrals 
are  only  approximated.  However,  we  also  propose  in  Section  3.5  a routine  which  fits 
linear  mixed  models  to  a Taylor  series  approximation  of  the  link  function  to  obtain 
approximate  maximum  likelihood  estimates.  Thus  we  refer  to  the  methods  in  this 
section  as  maximum  likelihood,  and  reserve  the  term  approximate  for  the  method 
in  Section  3.5.  In  this  section  we  discuss  two  such  maximum  likelihood  methods. 
The  first  directly  maximizes  (3.19)  using  adaptive  Gauss-Hermite  quadrature  to  ap- 
proximate the  intractable  integrals.  The  second  uses  the  EM  algorithm  to  indirectly 
maximize  (3.19)  using  Monte  Carlo  techniques  to  evaluate  the  integrals. 

3.3.1  Maximum  Likelihood  Algorithms 

There  has  been  a considerable  amount  of  recent  research  focused  on  accurate 
and  efficient  methods  for  obtaining  maximum  likelihood  estimates  in  generalized 
linear  mixed  models  (Zeger  and  Karim  1991;  McCulloch  1997;  Booth  and  Hobert 
1999).  Generally,  these  methods  can  be  categorized  by  whether  they  directly  or 
indirectly  maximize  the  marginal  (log)  likelihood,  and  by  whether  they  use  deter- 
ministic or  random  sampling  for  the  numerical  integration.  For  direct  maximization, 
the  marginal  (log)  likelihood  is  obtained  through  numerical  integration  and  then  di- 
rectly maximized  by,  for  example,  using  Fisher  scoring.  Indirect  maximization  is 
accomplished  through  the  EM  algorithm,  a powerful  iterative  method  for  obtaining 


56 


maximum  likelihood  estimators  in  incomplete  data  situations.  As  in  the  direct  maxi- 
mization method,  the  E-step  in  the  EM  algorithm  for  generalized  linear  mixed  models 
contains  intractable  integrals  and  so  numerical  integration  is  required.  Many  authors 
have  suggested  using  Gauss-Hermite  quadrature  or  Monte  Carlo  integrations  tech- 
niques to  evaluate  the  intractable  integrals  in  the  direct  and  indirect  maximization 
methods  (Fahrmeir  and  Tutz  1994,  Chap.  7;  McCulloch  1997;  Booth  and  Hobert 
1999). 

Gauss-Hermite  quadrature  is  an  example  of  a deterministic  sampling  approach 
to  numerical  integration.  In  general,  quadrature  rules  are  based  on  re-expressing  a 
regular  function  f(x)  as  the  product  of  a known  weight  function  w(x)  and  another 
function  h(x)  such  that  f(x)  = w(x)h(x).  Then,  for  predetermined  nodes  q and 
weights  w i,  an  integral  can  be  approximated  by  a discrete  summation, 

K 


f f(x)dx=  f w(x)h(x)dx^YfWlh(Q), 

J J 1=1 


(3.20) 


where  the  integration  is  over  the  domain  of  f(x).  Gauss-Hermite  integration  (Stroud 
and  Secrest  1966)  is  popular  in  statistics  due  to  the  form  of  the  weight  function  w(x) 
and  the  bounds  of  the  integration.  Specifically,  univariate  Gauss-Hermite  quadra- 
ture approximates  integrals  over  the  real  line  that  can  be  expressed  in  terms  of  the 
weight  function  w(x)  = exp(— x2).  Thus  for  univariate  Gauss-Hermite  quadrature, 
approximation  (3.20)  takes  the  form 

/+oo  />+ oo  K 

f(x)dx=  / exp  (—x2)h(x)dxtt'^mih(Q),  (3.21) 

-oo  j—  OO 

where  the  weights  wi  and  nodes  q are  obtained  from  Stroud  and  Secrest  (1966)  or, 
more  conveniently,  by  using  an  algorithm  proposed  by  Golub  (1973).  If  h(x)  is  a 
polynomial  of  degree  less  than  or  equal  to  2 K — 1,  the  approximation  given  by  (3.21) 
is  exact.  Thus  (3.21)  can  be  made  arbitrarily  accurate  by  increasing  the  number  of 
nodes  K . The  Gauss-Hermite  quadrature  rule  (3.21)  is  easily  applied  to  statistical 


57 


integration  problems  in  which  the  normal  density  gn(x\  g,a2)  can  be  expressed  as 
a weight  function,  w(x)  = g^ix]  fi,a2).  Such  integrals  can  be  approximated  using 
(3.21)  by  substituting  and  q with  tx~1^2  wi  and  \/2(7q  + /r,  respectively. 

The  Gauss-Hermite  rule  (3.21)  can  be  extended  to  m-dimensional  integrals  if  the 
weight  function  is  of  the  form  w(x)  = exp(—  Y1T=  ixD-  If  the  weight  function  is  the 
multivariate  normal  distribution  #mvn(x;  /x,  E),  the  m-dimensional  Gauss  Hermite  rule 
takes  the  form 


/+oo  r+oo  K K 

^mvn(x;  At,  s)  /i(x)dx  « X " ’ X 7r_m/2 
•00  J—  OO  ; i i i 


(3.22) 


*i=i  im= l 


where  mi  = n™i  = \/2  E1/2  qi  + = (cq,  • • • ,qm),  £1/2  denotes  the  left 

Cholesky  (lower  triangular)  square  root  of  E,  and  and  {q,},  i = 1,  • • • ,m,  are 

the  weights  and  nodes  from  the  univariate  Gauss-Hermite  rule.  Note  that  the  summa- 
tion in  (3.22)  is  over  Km  nodes  so  that  the  number  of  nodes  increases  exponentially 
with  the  number  of  dimensions  m.  Thus  multivariate  Gauss-Hermite  quadrature  is 
currently  only  computationally  feasible  for  integral  dimensions  of  up  to  five  or  six.  It 
should  be  noted,  however,  that  K should  be  chosen  large  enough  to  ensure  accurate 
approximation  of  the  integrals. 

In  contrast  to  Gauss-Hermite  quadrature,  Monte  Carlo  methods  use  randomly 
sampled  nodes  to  approximate  integrals.  In  Monte  Carlo  integration  techniques  the 
intractable  integrals  are  viewed  as  expectations  with  respect  to  some  density  function. 
For  example,  consider  the  integral 


/+oo 
-OO 


h(x)  g(x)dx, 


(3.23) 


where  h(x)  is  a continuous  function  and  g(x)  is  a density.  A simple  Monte  Carlo 
method  for  calculating  (3.23)  consists  of  approximating  / by 

I = 77  X 


i=i 


58 


where  q,  i = 1,  • • • , K are  independently  and  identically  distributed  samples  from  the 
density  g(x).  By  the  Law  of  Large  Numbers,  I converges  almost  certainly  to  I (Tanner 
1996,  p.  51).  More  sophisticated  and  efficient  approximations  based  on  Monte  Carlo 
methods,  such  as  importance  sampling  and  rejection  sampling  (see,  e.g.,  Tanner  1996, 
p.  54),  can  also  be  utilized.  Extensions  to  multivariate  integrals  is  straightforward 
where  samples  are  then  drawn  from  a multivariate  candidate  distribution.  Compu- 
tationally, Monte  Carlo  methods  are  more  attractive  than  Gauss-Hermite  methods 
as  the  number  of  draws  does  not  increase  exponentially  with  the  dimension  of  the 
integrals.  It  is  also  possible  to  assess  the  error  in  the  integral  approximations,  as 
pointed  out  by  Booth  and  Hobert  (1999)  and  discussed  shortly.  The  assessment  of 
the  error  in  Gauss-Hermite  quadrature  is  extremely  complicated,  requiring  evaluation 
of  the  2/fth  derivative  of  the  integrand  being  approximated. 

The  approaches  used  to  fit  multivariate  generalized  linear  mixed  models  have  gen- 
erally been  modeled  after  those  used  for  generalized  linear  mixed  models.  For  single 
random  effects  models,  a Gauss-Hermite  EM  algorithm  was  used  by  Jansen  (1990), 
while  Ezzet  and  Whitehead  (1991)  utilized  a Gauss-Hermite  Newton-Raphson  algo- 
rithm. Both  Hedeker  and  Gibbons  (1994)  and  Tutz  and  Hennevogl  (1996)  proposed 
algorithms  for  fitting  general  ordinal  regression  models  that  allowed  multiple  random 
effects.  Hedeker  and  Gibbons  (1994)  and  Hedeker  (2000)  considered  a direct  maxi- 
mization approach  using  multivariate  Gauss-Hermite  quadrature  to  approximate  the 
integrals  in  conjunction  with  Fisher’s  method  of  scoring  for  maximization.  They,  how- 
ever, used  a modified  multivariate  Gauss-Hermite  rule  as  compared  to  that  given  in 
(3.22).  They  orthogonally  transformed  the  response  model  so  that  the  weight  function 
was  the  multivariate  standard  normal.  Specifically,  assume  that  the  vector  of  ran- 
dom effects  u*  ~ MVN( 0,  E)  and  let  E = E1/2  E1/2'  where  E1/2  is  the  left  Cholesky 
factor  of  E.  Then  the  linear  predictor  rj^  = Zijf3  + B^-Uj  can  be  transformed  to 
riij  = Zij/3  + Wij  E1/2  a;  where  a;  has  the  multivariate  standard  normal  distribution. 


59 


Using  some  matrix  algebra,  the  more  common  linear  form  can  be  obtained 


G ij  = ZijP  + W^O, 


u 


(3.24) 


where  = a .•  <g)  Wij,  G = vech(E1/2),  and  (g>  denotes  the  Kronecker  product.  Note 
that  vech(M)  denotes  a column  vector  with  the  unique  stacked  rows  of  the  upper 
triangle  of  M . That  is,  for  example, 


vech(M)  = vech 


/ \ 

mu  mi  2 


mi2  m2  2 


m ii 
mu 
m22 


Maximum  likelihood  estimates  are  then  obtained  for  the  fixed  effects  parameter  vec- 
tor (3  and  the  unique  elements  6 of  the  Cholesky  factor  of  E.  A consequence  of  this 
approach  is  that  the  weight  function  becomes  the  multivariate  standard  normal  distri- 
bution. Thus  the  univariate  Gauss-Hermite  nodes  no  longer  need  to  be  transformed 
as  in  (3.22).  Another  advantage  is  that  estimation  of  the  Cholesky  elements  G instead 
of  the  covariance  elements  of  E is  often  more  stable  when  the  true  variance  elements 
are  near  zero.  A disadvantage  is  that  it  is  often  difficult,  and  sometimes  impossible 
to  determine  the  relationship  between  the  parameters  in  E and  the  parameters  in  G 
when  constraints  are  placed  on  the  elements  of  E. 

Tutz  and  Hennevogl  (1996)  considered  an  indirect  maximization  approach  for 
fitting  their  general  ordinal  random  effects  model.  Using  the  transformed  linear  pre- 
dictor (3.24),  they  outlined  both  a multivariate  Gauss-Hermite  EM  algorithm  and  a 
Monte  Carlo  EM  algorithm.  In  both  algorithms  they  replaced  the  intractable  inte- 
grals in  the  E-step  by  numerical  approximations.  Due  to  the  transformation  (3.24), 
the  Monte  Carlo  approximation  required  sampling  from  the  multivariate  standard 
normal  distribution.  An  important  issue  in  both  Monte  Carlo  integration  and  Gauss- 
Hermite  quadrature  is  determining  the  number  of  samples,  in  the  former,  or  nodes, 
in  the  latter,  that  are  required  to  adequately  approximate  the  integrals.  For  models 


60 


with  a single  random  effect,  Tutz  and  Hennevogl  (1996)  suggested  that  only  10  to 
20  samples  were  needed  for  the  Monte  Carlo  integration  and  only  8 to  10  quadra- 
ture points  for  Gauss-Hermite  integration.  When  they  considered  a four-dimensional 
random  effect,  they  only  recommended  Monte  Carlo  integration  with  20  to  30  sam- 
ples as  they  reported  extremely  high  computational  time  for  the  Gauss-Hermite  EM 
algorithm.  For  their  Gauss-Hermite  Fisher  scoring  algorithm,  Hedeker  and  Gibbons 
(1994)  suggested  that  the  number  of  quadrature  points  could  actually  be  decreased 
as  the  number  of  random  effects  was  increased.  For  example,  they  suggested  that  as 
few  as  three  points  per  dimension  for  a five-dimensional  random  effects  model  would 
be  adequate  for  approximating  the  integrals.  From  our  experience  and  others  (Pin- 
heiro  and  Bates  1995;  Agresti  and  Hartzel  1999),  these  recommendations  will  not 
provide  adequate  approximations  of  the  integrals.  For  the  nonlinear  mixed  effects 
logistic  model  with  a single  random  effect,  Pinheiro  and  Bates  (1995)  reported  that 
over  100  quadrature  points  were  required  to  obtain  the  correct  maximum  likelihood 
estimates.  Agresti  and  Hartzel  (1999)  reported  similar  results  for  the  logistic-normal 
model.  In  Section  3.6  we  illustrate  these  findings  for  models  with  ordinal  responses. 
The  accuracy  of  using  only  20  to  30  Monte  Carlo  samples  is  also  questionable,  since 
no  attempt  was  made  to  assess  the  Monte  Carlo  error  in  the  approximations. 

We  now  propose  two  algorithms  for  approximating  and  maximizing  the  marginal 
likelihood  (3.19)  of  the  general  multinomial  random  effects  model.  The  first  method 
utilizes  adaptive  Gauss-Hermite  quadrature  (Liu  and  Pierce  1994;  Pinheiro  and  Bates 
1995)  to  numerically  approximate  the  intractable  integrals,  and  then  directly  max- 
imizes the  marginal  likelihood  with  a quasi-Newton  algorithm.  Thus  we  follow  the 
direct  maximization  approach  of  Hedeker  and  Gibbons  (1994)  and  Hedeker  (2000). 
Note,  however,  that  our  algorithm  differs  with  respect  to  their  algorithm  in  two 
ways.  First,  we  utilize  adaptive  quadrature  which,  as  noted  above,  often  requires 
substantially  fewer  points  for  obtaining  the  same  decimal  accuracy  as  Gauss-Hermite 


61 


quadrature.  Secondly,  we  utilize  a quasi-Newton  algorithm  in  place  of  their  Fisher 
scoring  algorithm.  As  noted  in  Chapter  2,  the  main  difference  between  the  Fisher 
scoring  algorithm  and  the  Newton-Raphson  algorithm  is  that  the  expected  informa- 
tion matrix  is  used  in  the  former,  while  the  observed  information  matrix  is  used  in  the 
latter.  Thus  the  Fisher  scoring  algorithm  has  the  advantage  of  using  an  information 
matrix  that  is  always  positive  definite.  For  random  effects  models,  however,  we  feel 
that  the  observed  information  matrix  is  easier  to  calculate.  In  general,  the  Fisher 
scoring  algorithm  and  the  Newton-Raphson  algorithm  will  perform  similarly  for  a 
given  situation. 

In  our  second  proposed  algorithm  a Monte  Carlo  EM  algorithm  is  used  to  in- 
directly maximize  (3.19).  Specifically,  we  apply  the  Monte  Carlo  EM  algorithm  of 
Booth  and  Hobert  (1999)  which  allows  for  the  assessment  of  the  Monte  Carlo  error 
in  the  integral  approximations.  This  approach  is  quite  different  from  that  of  Tutz 
and  Hennevogl  (1996)  who  did  not  evaluate  the  error  in  the  Monte  Carlo  integration. 
In  addition,  the  number  of  Monte  Carlo  samples  was  fixed  at  the  start  of  the  algo- 
rithm. In  our  approach,  the  sample  size  can  be  increased  after  each  iteration  of  the 
EM  algorithm.  Since  the  EM  algorithm  is  inherently  slow,  we  recommend  its  use  in 
situations  where  the  integral  dimensions  are  extremely  high,  for  which  the  adaptive 
quadrature  algorithm  becomes  too  computationally  burdensome.  For  most  applica- 
tions with  low  to  moderate  numbers  of  random  effects,  such  as  those  considered  in 
Section  3.6,  the  adaptive  quadrature  algorithm  will  provide  the  most  efficient  means 
for  finding  maximum  likelihood  estimates. 


62 


3.3.2  Quasi-Newton,  Adaptive  Gauss-Hermite  Quadrature  Algorithm 

Denoting  the  multivariate  normal  density  by  (?mvn(x;  AL  £),  the  zth  set  of  integrals, 
i = 1,  • • • , n,  in  (3.19)  is 


LiifaX) 


/-/ 


T, 


II /(y^  I 


9mn{ui',  0,  E)  d u* 


(3.25) 


U=i 


The  multivariate  Gauss-Hermite  approximation  to  (3.25)  is  given  by  (3.22)  with  h(-) 

T, 

replaced  with  J}  f{yij  | /3;  u ,).  Such  an  approximation  centers  and  scales  the  original 
j= i 

Gauss-Hermite  nodes  to  have  the  same  mean  and  variance  as  the  multivariate  normal 
density,  Pmvn(uj;  0,  E).  Liu  and  Pierce  (1994),  for  approximating  a single  integral,  and 
Pinheiro  and  Bates  (1995),  for  approximating  multiple  integrals  in  nonlinear  mixed 
effects  models,  discussed  a modified  Gauss-Hermite  rule  which  Pinheiro  and  Bates 
(1995)  called  adaptive  Gauss-Hermite  quadrature.  Both  Liu  and  Pierce  (1994)  and 
Pinheiro  and  Bates  (1995)  noted  that  scaling  the  nodes  about  the  mean  and  variance 
of  just  (?hvn(u1;  0,  E)  was  inefficient  as  it  failed  to  take  into  account  the  impact  of  h(-) 
on  the  shape  of  the  integrals  to  be  approximated.  Instead,  they  recommended  the 
following  approach. 

Adaptive  Gauss-Hermite  quadrature 
Begin  by  letting 


h{ui ) = 

so  that  (3.25)  becomes 


t. 


n/(y,  I P'’ui) 

j= 1 


$MVN(ub  0,  E), 


(3.26) 


J ■■■  j h(ui ) dui 


(3.27) 


With  adaptive  Gauss-Hermite  quadrature,  one  uses  the  mode  /i*  of  h(u;),  and  the 
curvature  E*  around  the  mode  of  h(uj)  to  center  and  scale  the  original  Gauss-Hermite 
nodes.  To  utilize  Gauss-Hermite  quadrature,  (3.27)  must  be  written  in  the  appropri- 
ate form  which  requires  the  weight  function  exp(—  Y1T=i  uh)-  T°  this  end  (3.27)  is 


63 


rewritten 


/.../ 


h(ui) 


<7mvn(u6 


^hvn(uj-;  /x*,  E*)  d Uj, 


(3.28) 


where  ^MVN(ui;  n*,  E*)  is  the  multivariate  normal  density  with  mean  n*  and  covariance 
matrix  E*.  The  multivariate  Gauss-Hermite  rule  (3.22)  can  now  be  applied  to  (3.28) 
with  h(-)  in  (3.22)  replaced  with 


h{uj) 

fl'HVN(Uj;  /x*,  E*) 

The  adaptive  Gauss-Hermite  quadrature  approximation  of  (3.25)  is  then  given  by 


U(0,V)  = f ■ I 


n/(y,  I ^u‘) 


U=i 

K K 


E.  |i/2 


1.  In 


Ti 


5mra(ui;0,  E) 

^MVN«b  0,  E)  exp(<jj  ?,),  (3.29) 


n/(yb  I 

J=1 


where  mx  = fl^i  ?*i  = v^E*172.;,  + /ij,  ^ = (Qi,---  ,Qm),  and  {wu}  and 
{q.},  * = l,---  , ra,  are  the  weights  and  nodes  from  the  univariate  Gauss-Hermite 
rule.  Adaptive  Gauss-Hermite  quadrature  can  be  viewed  as  a deterministic  version  of 
Monte  Carlo  integration  in  which  the  random  samples  generated  from  ^mvn(u,;  At*,  E*) 
are  replaced  with  the  fixed  values  {<^*}. 

The  use  of  adaptive  Gauss-Hermite  quadrature  can  substantially  decrease  the 
number  of  nodes  needed  to  obtain  adequate  approximations  of  the  intractable  in- 
tegrals (Pinheiro  and  Bates  1995;  Agresti  and  Hartzel  1999).  For  a logistic-normal 
model  with  a random  intercept,  Agresti  and  Hartzel  (1999)  needed  only  9 quadra- 
ture points  to  obtain  convergence  to  four  decimal  places  with  adaptive  Gauss-Hermite 
quadrature,  as  opposed  to  about  200  for  standard  Gauss-Hermite  quadrature.  Similar 
results  are  shown  in  Section  3.6.  Adaptive  Gauss-Hermite  quadrature  is  computa- 
tionally more  complex  than  standard  Gauss-Hermite  quadrature,  however.  For  each 
subject  i,  the  mode  /xt*  of  /i(u;)  in  (3.26),  and  the  curvature  E*  at  the  mode  of  h(u;) 


64 


must  be  calculated.  These  estimates,  based  on  the  current  estimates  of  (3  and  vech(E), 
must  be  recalculated  for  each  iteration  of  the  maximization  routine.  The  mode  n* 
can  be  found  using  any  maximization  routine.  An  estimate  of  the  curvature  around 
the  mode  can  be  obtained  by  inverting  the  negative  of  the  second  derivative  matrix 
of  h(ui)  evaluated  at  the  estimated  mode.  We  use  numerical  second  derivatives  to 
obtain  E*. 

Quasi-Newton  algorithm 

We  incorporate  the  adaptive  Gauss-Hermite  approximation  into  a quasi-Newton 
algorithm  to  directly  maximize  the  log  of  (3.19)  for  the  general  multinomial  ran- 
dom effects  model.  Specifically,  we  utilize  the  Broyden-Fletcher-Goldfarb-Shanno 
(BFGS)  algorithm  which  was  proposed  independently  by  Broyden  (1970),  Fletcher 
(1970),  Goldfarb  (1970),  and  Shanno  (1970).  The  BFGS  algorithm  was  designed  for 
maximizing  scalar  objective  functions  and  so  is  well  suited  for  maximum  likelihood 
estimation.  Denoting  the  complete  vector  of  parameters  by  \F  = (/3  , vech(E)/),  the 
BFGS  quasi-Newton  algorithm  is  defined  at  the  ( s + l)th  iteration  by 

*(s+1)  = 4'(s)  + £(5)  n[s)  -1  g(s),  (3.30) 

where  T'  is  the  estimate  of  \I>  at  the  previous  iteration,  is  a step  length  between 
zero  and  one,  and  and  denote  the  gradient  vector  and  Hessian  matrix  of  the 
log  of  (3.19),  respectively.  For  our  application  of  the  BFGS  algorithm,  we  analytically 
compute  g by  computing  analytical  first  derivatives  of  the  log  of  (3.19)  with  respect 
to  which  are  given  below.  Calculation  of  the  Hessian  matrix  is  not  required  as 
the  BFGS  algorithm  updates  this  automatically.  The  step  length  <5  is  normally  set  to 
one.  However,  if  no  improvement  is  made  in  the  log  likelihood  value  with  6 = 1,  a 6 
that  improves  the  value  is  found  by  a line  search.  Convergence  is  based  on  both  the 
change  in  the  parameter  estimates  and  the  change  in  the  gradient  vectors. 


65 


We  choose  to  provide  analytical  first  derivatives  for  the  gradient  function  g instead 
of  using  numerical  derivatives.  Though  the  analytical  derivatives  require  additional 
calculation,  they  provide  increased  accuracy  over  numerical  derivatives  and  are  needed 
for  calculation  of  the  observed  information  matrix  upon  convergence  of  the  algorithm. 
Denote  the  log  of  (3.19)  by  Z(/3,  E)  and  define  Li((3 , E)  as  in  (3.25).  We  first  consider 
the  derivative  of  Z(/3,  S)  with  respect  to  (3.  Interchanging  integrals  and  derivatives 
and  using  the  identity 


d 

d(3 


' Ti 

Ylfiyij  I £;ui) 

j=i 


‘ Ti 


j= 1 


d\ogf{yij 

d(3 


(3.31) 


one  obtains 


d/3 


tr  uis.s) 


x 


/ •/ 


Ti 


I £;u*) 

U=i 


5mvn(uj;  0,  S) 


' T 

E 

j=i 


d\ogf{yij  1 ^;ui) 

d/3 


dui  V . (3.32) 


Denoting  the  approximation  of  Lj(/3,  E)  in  (3.29)  by  Lj  and  using  the  centered  and 
scaled  nodes  {s*j}  from  (3.29),  the  derivative  in  (3.32)  can  be  approximated  by 

n Ti  K K 


dl(0,T,)  ^ . dlog/fy.j  | /?;«£) 


<10 


EEE-E* 

*=i  j=i  h im 


d/3 


(3.33) 


where 


Ql  ~ 


I 0;<i) 
j=i 


Smvn«i;0,E)  I E*  |1/2  2m/2  zui  exp(ft  ft).  (3.34) 


Note  that  in  (3.34)  evaluates  to  a scalar.  Then  since  /( y y | /3;  c*,)  is  a multivariate 
generalized  linear  model,  (3.33)  can  be  viewed  as  a weighted  score  function  with 
weights  {cii}.  Using  the  form  of  a weighted  score  function  (Tutz  and  Hennevogl 


66 


1996),  the  approximate  derivative  (3.33)  can  be  calculated  from 


d(3 


y y i y ] ^ (yp  ) > (3.35) 

«=1  ll  lm  j = l 


where  Dij,  Rn,} , and  Zy  depend  on  the  multinomial  link  and  model,  and  are  given 
in  Chapter  2.  Note  that  (3.35)  is  evaluated  by  plugging  in  the  current  estimates  of 
f3  and  E. 

We  proceed  now  with  the  derivative  of  l(f3,  E)  with  respect  to  vech(E),  the  unique 
elements  of  E.  Again  interchanging  integrals  and  derivatives  and  using  a similar 
identity  to  (3.31), 


dl((3,  E) 
d vech(E) 


= E 

i= 1 


/-/ 


£i(/3,£) 

Ti 


i ^;ui) 

Lj=l 


( n ^log^HVN(w;0,E) 

,9hvn  ( Uj ; 0,  E)  — — — — dui 

d vech(E) 


(3.36) 


Using  the  same  approach  as  for  (3.33),  the  derivative  (3.36)  can  be  approximated  by 

tovHfcjj;  Q)  £) 
vech(E) 

with  Cji  given  in  (3.34).  By  substituting  into  (3.37)  the  first  derivative  of  the  log  of 
a multivariate  normal  density  with  respect  to  the  unique  elements  of  E,  one  obtains 
the  approximation, 


(3.37) 


dl((3,  E) 
d vech(E) 


71  K 

EE 

h 


K 

E 


i= 1 


d log/ 
d 


dl{f3,  E) 


n K 


K 


EE-E* 

*=1  ll  lm 


d vech(E) 

-S-1  + i diag(E-‘)  + E-‘  <*,  4 E"1  - l diag(E-1  <*,  4 E_1) 


. (3.38) 


where  diag(M)  is  the  matrix  M with  all  off-diagonal  elements  set  to  zero.  Evaluation 
of  (3.38)  is  accomplished  by  plugging  in  the  current  estimates  of  (3  and  E. 


67 


Programming 

The  matrix  programming  language  Ox  (Doornick  1998)  was  used  to  program  the 
quasi-Newton,  adaptive  Gauss-Hermite  quadrature  algorithm.  Ox  was  designed  for 
programming  in  matrices  and  computing  matrix  calculations,  and  thus  is  especially 
suited  for  statistical  programming.  To  take  advantage  of  Ox’s  matrix  capabilities, 
and  to  avoid  the  multiple  summations  found  in,  for  example,  (3.38),  we  expanded 
the  data  vectors  to  match  the  total  number  of  quadrature  points  Km  (Hinde  1982). 
Thus  for  observation  j of  subject  i,  y *j  = l/fm  ®Yij  and  xjj  = 1^-m  where  1 xm 

is  Km  by  one  column  vector  of  ones.  One  can  then  create  expanded  design  matrices 
Z*j  and  define  the  random  effects  design  matrix  to  be  W*j  = [^*,  ] which  contains  the 
Km  stacked  row  vectors  of  the  scaled  and  centered  nodes  for  the  ith  subject.  One 
must  be  careful,  however,  as  the  matrices  can  become  very  large.  For  example,  the 
dimension  of  would  be  (R  — 1)  * Km  by  (R  — 1)  + p. 

Naive  starting  values  for  the  fixed  effects  parameters  can  be  obtained  from  fitting 
the  model  without  the  random  effects.  For  the  covariance  matrix  £,  one  can  start  with 
the  identity  matrix  as  naive  initial  estimates.  Alternatively,  one  could  use  the  final 
estimates  from  an  approximate  maximum  likelihood  algorithm,  as  will  be  described 
in  Section  3.5.  These  estimates  are  usually  close  to  the  true  maximum  likelihood  esti- 
mates, and  thus  reduce  the  time  required  for  the  algorithm  to  converge.  We  have  also 
had  success  using  initial  estimates  obtained  from  fitting  the  quasi-Newton  algorithm 
with  a smaller  number,  say  five  in  each  dimension,  of  quadrature  points.  This  is  often 
useful  for  fitting  a model  with  a large  number  of  random  effects,  where  starting  from 
naive  estimates  with  a large  number  of  quadrature  points  in  each  dimension  would 
require  many  more  likelihood  evaluations  to  obtain  the  maximum.  In  terms  of  the 
number  of  quadrature  points  to  use,  there  is  no  golden  number  that  will  ensure  ad- 
equate approximation  of  the  integrals  for  every  dataset.  We  recommend  upwards  of 
15  per  dimension.  However,  one  should  always  try  additional  runs  beyond  the  chosen 


68 


number  to  ensure  the  estimates  have  stabilized.  From  our  experience,  higher  numbers 
of  quadrature  points  are  needed  to  obtain  convergence  in  the  standard  errors  than 
in  the  parameter  estimates,  as  was  also  noted  for  binary  random  effects  models  by 
Agresti  and  Hartzel  (1999). 

3.3.3  Monte  Carlo  EM  Algorithm 

The  algorithm  of  the  previous  section  is  very  efficient  when  the  number  of  random 
effects  is  less  than  five  or  six.  Beyond  this  point,  however,  the  computational  burden 
of  evaluating  Km  points  for  each  observation  within  each  iteration  becomes  too  great. 
Alternatively,  one  can  use  Monte  Carlo  methods  to  approximate  the  intractable  inte- 
grals. Monte  Carlo  methods  do  not  experience  the  curse  of  dimensionality  problems 
that  plagues  Gauss-Hermite  integration,  as  the  number  of  samples  K does  not  in- 
crease exponentially  with  the  number  of  random  effects  (we  note  that  the  number  of 
samples  needed  to  approximate  integrals  does  increase  with  m,  but  generally  not  at 
an  exponential  rate).  Monte  Carlo  methods  can  also  be  formulated  so  that  one  can 
assess  the  error  in  the  integral  approximations  (Booth  and  Hobert  1999).  The  Monte 
Carlo  EM  algorithm  proposed  by  Tutz  and  Hennevogl  (1996)  lacked  any  assessment 
of  Monte  Carlo  error.  This,  compounded  with  the  few  samples  drawn  to  approximate 
each  integral  (maximum  of  30)  makes  estimates  from  that  algorithm  suspect. 

Booth  and  Hobert  (1999)  proposed  an  automated  Monte  Carlo  EM  algorithm 
for  fitting  generalized  linear  mixed  models.  The  algorithm  is  “automated”  in  the 
sense  that  at  each  iteration,  the  Monte  Carlo  error  is  assessed  and  the  number  of 
samples  is  increased  if  the  change  in  parameter  estimates  from  the  previous  iteration 
is  “swamped”  with  Monte  Carlo  error.  To  make  this  possible,  independently  and 
identically  distributed  random  samples  are  generated  at  each  iteration,  allowing  one 
to  use  standard  central  limit  theory  to  assess  the  Monte  Carlo  error.  We  now  extend 
their  algorithm  to  the  multivariate  generalized  linear  mixed  models  considered  here. 


69 


The  EM  algorithm  (Dempster  et  al.  1977)  is  an  iterative  method  for  obtaining 
maximum  likelihood  estimators  in  situations  with  incomplete  data.  For  multivariate 
generalized  linear  mixed  models,  the  random  effects  u = (u),  • • • , u)J  are  treated  as 
the  missing  data.  The  EM  algorithm  is  defined  in  terms  of  the  complete  log-likelihood 

n 

/c(^)  = log/{y,u;^}  = [log /(ft  | /3;ut)  + log .9HVN(ui;  0,  £)] , (3.39) 

i= 1 

where  \S''  = (/?',  vech(E)')  and,  for  convenience,  /(ft  | /3;Uj)  = /(ftj  I /3;u*). 

The  algorithm  is  divided  into  an  Expectation  Step  (E-step)  and  a Maximization  Step 
(M-step). 

E-step 

We  begin  by  considering  the  E-step  at  the  ( s + l)th  iteration.  In  the  E-step,  the 
expectation  of  the  complete  log-likelihood  (3.39)  is  determined  with  respect  to  the 
conditional  distribution  h( u | ^(s/y).  That  is, 


Q(*  | *(*>)  = £{log/(y,u;*l>)  | ^;y} 

tf"  7 [l'Jg/(y,  13:  u.)  + logy„v«(u,;0,  E)]  Wu,  | du, 

(3.40) 


i=l 


The  expectation  in  (3.40)  can  not  be  obtained  in  closed  form  since  the  conditional 
distribution  h(uj  | ^^s/yj)  contains  the  multivariate  normal  density.  However,  one 
can  use  Monte  Carlo  methods  to  approximate  this  expectation  by  generating  samples 
from  h(ui  | tP^/yj).  For  the  Monte  Carlo  EM  algorithm  proposed  by  Booth  and 
Hobert  (1999),  the  generated  samples  used  to  approximate  (3.40)  must  be  indepen- 
dently and  identically  distributed.  They  suggested  either  importance  sampling  or 
rejection  samples  to  generate  such  samples,  of  which  we  utilize  the  latter  approach 
for  our  algorithm. 

The  following  multivariate  rejection  procedure  (Geweke  1996)  can  be  used  to  select 
K random  samples  l = 1,  - ■ • ,K,  from  /r(u,  | y*): 


70 


1.  Sample  from  <7mvn(ui;  0,  E^)  and  independently  sample  w from  the 
uniform(0,l)  distribution. 

2.  Accept  if  w < /(yj  | /3^;<?^)/r,  otherwise  go  to  1, 


where  r = supu  /(y^  | /3^;u j).  To  calculate  r,  note  that  /(y*  | /3(s,;u,)  can  be 


00. 


considered  a multivariate  generalized  linear  model  with  unknown  parameter  vector 
Uj.  Thus  one  can  find  the  maximum  likelihood  estimate  of  Uj,  say  u,,  by  fitting  a 
multivariate  generalized  linear  model  that  includes  an  offset  of  Zij/3 The  value  of 
r is  then  given  by  r = /(y,-  | /3(s);  u<). 

Given  n sets  of  K multivariate  samples  from  /i(u,  | VE^sy*),  i = 1,  • • • , n,  a K 
multivariate  sample  approximation  to  Q( | \Er^)  is  given  by 

1 n K 

qk(*  | *<■>)  = -g  J2T.  [l»g/(y.  I tlfb  + logshmfcl,*’; o, S)]  (3.41) 

i=l  1=1 

We  will  consider  the  assessment  of  the  Monte  Carlo  error  in  (3.41)  below. 

M-step 

The  M-step  at  the  ( s + l)th  iteration  consists  of  maximizing  the  Monte  Carlo 
approximation  QK( \Er  | in  (3.41)  with  respect  to  Since  the  elements  (3  and 
E of  \Er  occur  separately  in  the  two  terms  of  (3.41),  the  terms  can  be  maximized 
individually.  Consider  the  first  term 

^ it,  ^2 log  f&t  I *«  *)>  (3-42) 

i=i  i=i 

and  note  that  (3.42)  is  a multivariate  generalized  linear  model.  By  replicating  the 
data  vectors  y ij  and  x*,-  K times,  the  linear  predictor  for  the  Ith  multivariate  sample 
of  the  jth  observation  on  the  ith  subject  can  be  written 


Viji  = Zijip  + Wijitf. 


71 


(s) 

Maximization  with  respect  to  (3  can  be  carried  out  by  Fisher  scoring  with  as 

an  offset  term.  The  same  Fisher  scoring  algorithm  given  in  Section  2.3  applies  here, 
with  an  additional  summation  over  l in  (2.6)  and  (2.7). 

The  second  term  to  be  maximized  in  (3.41)  is  given  by 

0,  E),  (3.43) 

i=l  1=1 

which  is  just  the  log  of  a multivariate  normal  density.  For  unstructured  covariance 
matrices  or  when  the  random  effects  are  assumed  independent,  closed  form  solutions 
for  the  maximum  likelihood  estimate  of  E exist.  Generally,  one  can  use  an  iterative 
procedure  to  find  the  maximum  of  (3.43)  with  respect  to  E (see,  e.g.,  Searle  et  al. 
1992,  Chap.  11). 

Monte  Carlo  error  of 

The  approximation  of  Q(^  | \F(s))  in  (3.41)  inevitably  will  contain  Monte  Carlo 
error.  This  error  is  propagated  through  to  the  M-step,  resulting  in  incorrect  estimates 
for  (3  and  E.  The  Monte  Carlo  error  can  be  made  smaller  by  approximating  Q(^  | 
\F^)  with  larger  multivariate  samples  K.  However,  the  larger  K is  chosen,  the  longer 
the  estimation  routine  will  take  to  converge.  Booth  and  Hobert  (1999)  proposed  a 
method  for  evaluating  the  Monte  Carlo  error  in  the  estimates  of  j3  and  E.  Using  the 
error  estimate,  they  proposed  a way  to  automate  the  choice  of  K at  each  iteration. 
Thus,  they  could  start  with  a smaller  K , when  the  parameter  estimates  were  far  from 
the  maximum  likelihood  estimates,  and  increase  K as  they  neared  the  maximum 
likelihood  estimates.  In  addition,  after  taking  into  account  the  Monte  Carlo  error 
in  the  parameter  estimates,  they  could  accurately  evaluate  the  convergence  of  the 
parameter  estimates. 

Following  Booth  and  Hobert  (1999),  define  | T'  ) = -yy  Q(^  | \F  ) for 

d\r 

j — 1,2  and  define  similarly.  Booth  and  Hobert  (1999)  showed  that  conditional 


72 


on  \F^,  is  approximately  normal  with  mean  and  variance 


var(\F(s+1)  | 'F(s))  « 

Qg)(^*('+1)  | tfW)"1  var  {q£)(**(,+1)  | ^(s))}  Q^(^*(s+1)  | *(i))-\  (3.44) 

where  ^*(s+1)  satisfies  Q^(^r*(s+1^  | \F(s))  = 0.  An  estimate  of  (3.44)  is  obtained  by 
substituting  \E^4+1'  for  T,*(4+1\  and  estimating  var  |Q^)(^r*^+1^  | \F^)j  with 


far{Q£)('r(s+1)  | tf(s 

i«g/(y. <=!"; *l*+1,)}{^  iog/(y,<!');*('+1) 


(3.45) 


where  = (^  , • • • , ^ ),  l = 1,  ■ • • ,K  are  the  K sets  of  n multivariate  ran- 
dom samples  for  the  sth  iteration.  For  the  multinomial  random  effects  models,  the 
elements  of  (3.44)  and  (3.45)  have  the  forms  given  below. 

We  first  consider  log /(y,  ^r^s+1^),  which  is  the  first  derivative  of  the 

complete  log-likelihood  (3.39)  with  respect  to  \F,  evaluated  at  and  the  random 

sample  $[s\  Only  the  first  term  in  (3.39)  contains  /3,  and  it  has  the  form  of  a 
multivariate  generalized  linear  model.  Thus  the  derivative  of  log /(y,  \ld,?+b) 

with  respect  to  (3  is  just  the  score  function: 

A log/(y,c!'^<‘+1))=^^ZgAiBi1..(yu-^).  (3-46) 

^ 1=1  j= 1 

where  Dij,  RntJ,  and  are  calculated  using  /3^s+1*  and  $[s\  The  second  term  in 
(3.39)  is  the  log  of  a multivariate  normal  density,  which  has  a derivative  with  respect 
to  vech(E)  of 


d^h(E)  log/(y,c!*>;  *<•«)  = g [-S'-’'1  + 5 diag  (^T1) 

4 E<*+1>~‘  4"  El'+I)-1  _ i diag  4*>  4*1'  £(■+')■') 


. (3.47) 


73 


The  derivatives  (3.46)  and  (3.47)  are  stacked  into  a column  vector  and  used  in  (3.45). 

To  estimate  (3.44),  we  also  need  to  calculate  Q^(\E^3+1)  | \jds))  which  corresponds 
to  the  second  derivative  matrix  of  (3.41)  with  respect  to  Vfr '!>  and  has  the  form 


d 2 


d<S>  d<& 


T Qk{*{3+1)  | ^(S))  = 


d2 

d/3d/3‘ 

0 


Qk 


0 


d2 


V Qk 


(3.48) 


d vech(S)  d vech(E) 

Note  that  the  off-diagonal  elements  are  zero  since  (3  and  £ do  not  occur  together 
in  either  the  first  or  second  term  of  (3.41).  Replicating  y ^ and  x,j  K times  and 
exploiting  that  fact  that  the  first  term  is  a multivariate  generalized  linear  model, 
the  second  derivative  of  Qir(^^+1'  | Sf'^)  with  respect  to  (3  is  just  the  observed 


information  matrix, 


d2 


d/3  d/3 


^ n K Ti 

- Qk  = K £ £ £ {z'<»  D,i‘  D‘i‘  Zi» 

i= 1 1=1  j= 1 

9 

^ ^ Z ijiUijir Ziji(yijir  7Tj j/r) 


, (3.49) 


r=l 


where  Diji,  Rn.j,,  Uijir,  and  are  computed  using  and  , and  Uijir  is 

defined  as  in  Chapter  2. 

The  second  element  in  (3.48)  is  the  second  derivative  of  the  log  of  a multivariate 
normal  density  with  respect  to  the  unique  elements  of  £.  To  provide  an  explicit 
formula  for  calculating  this  derivative  matrix,  we  first  need  the  following  notation. 
Notation.  (Graham  1981) 

Ert  : A q by  q matrix  of  zeros  with  a one  in  the  (r,  f)th  cell. 

Iq  : A q by  q identity  matrix. 

Jq  : A q by  q matrix  of  ones. 

6rt  = 1 if  r / f,  0 otherwise. 

U = £r=l  ELl  Ert  ® Est. 

U = EUi  ELi  En  <8.  Ert. 


74 


U*  = U + U-  ZUi  Err  ® Err. 


We  also  denote  element-wise  multiplication  between  two  conformable  matrices  by  the 
symbol  •*.  Define  the  matrix  An  by 

9 9 


M = E El  [e«  ® (s(,+1)  *£*)]  ® [/,  ® (e<s+1)  1 4'}  4,}'  E<s+1> 

r=l  t=l  ^ 

t [^rt  <8>  (E(*+1)_1Etr)]  [/,  (8)  (e<s+1>-1  4s)  4'y  £(s+1)_1)] 

Iq  ® (e'*+1>_1  4s)  4s)'  E^1)"1)]  [i?rl  <8>  (EtE(s+1>-1)] 


+ <5; 


+ <Srt  [/,  (8)  (e<'+1>  1 4S) 4iY  E(8+1)  *)]  [^t®(^S(s+1)  1)]|.  (3.50) 


Then 


d 2 

dvech(E)  dvech(E)' 


n AT 


i=i  1=1 


< ^ E EU7*  ® s(s+1)_1 1 i7*  ® s(4+1)  1 


i 


[Jq  ® -7<j]  ‘ * 


7g  <8>  E(s+1>  1 


u*  l/0®  e(s+1) 


-i 


^4i(  T 2 [*^ij  ® 


) 


(3.51) 


Using  (3.49)  and  (3.51)  in  (3.48),  one  obtains  an  estimate  for  Q^(^r(a+1)  | xfd*)). 
Algorithm  automation 

The  estimate  of  the  variance  of  given  can  now  be  used  to  construct  a 

rule  for  updating  the  Monte  Carlo  sample  size  K.  Consider  the  difference  ||\I>*(S+1)  — 
\Er^||,  which  is  the  difference  between  the  true  maximizer  of  at  the  ( s + l)th 
iteration  and  the  estimated  value  at  the  (s)th  iteration.  If  this  difference  is  small 
relative  to  the  Monte  Carlo  error  associated  with  ’J>(s+1),  then  will  be  of  no 

use  for  estimating  \I>*b+i)  as  it  will  be  overwhelmed  with  error.  At  such  a point,  one 
would  need  to  take  a larger  Monte  Carlo  sample  K to  help  reduce  the  error.  Booth  and 
Hobert  (1999)  suggested  calculating  a confidence  ellipsoid  about  and  checking 

if  the  previous  estimate  was  contained  in  the  region.  If  was  contained  in  the 
ellipsoid,  they  concluded  that  the  current  estimate  vjAs+1^  was  swamped  with  Monte 


75 


Carlo  error  and  that  K should  be  increased.  Specifically, 

if  (*(•+!)  - tJrC))'  w(^(s+1)  | ¥(,))_1  (*<,+1)  - *W)  < Xdf,l-a» 

then  K = K + (3.52) 

where  df  denotes  the  dimension  of  \lds+i).  The  most  appropriate  choices  for  A and 
a are  still  under  research,  however,  we  have  had  success  with  A = 4 and  a = 0.25. 

To  determine  convergence,  we  utilized  a standard  stopping  rule.  Denote  the  di- 
mension of  \Ir  by  p*.  Then  the  convergence  criterion  is  satisfied  if 

|vl>(s+1)  - 

max  i-Vr — < 0.002.  (3.53) 

|*j4)|  + 0.001 

As  the  EM  algorithm  can  be  extremely  slow  to  converge  when  a parameter  is  near 
the  boundary  of  the  parameter  space  (e.g.,  when  a variance  component  is  near  zero), 
one  should  also  monitor  the  parameter  estimates  during  the  algorithm  to  ensure  this 
is  not  occurring.  We  programmed  the  Monte  Carlo  EM  algorithm  using  the  Ox 
programming  language. 


3.4  Inference  and  Prediction 

We  now  consider  inference  for  the  fixed  effects  /3  in  the  multivariate  generalized 
linear  mixed  model,  as  well  as  prediction  of  the  random  effects  Uj,  i = 1,  • • • , n.  As  the 
estimates  obtained  from  the  quasi-Newton  algorithm  and  Monte  Carlo  EM  algorithm 
are  approximate  maximum  likelihood,  inference  concerning  the  fixed  effects  is  based 
on  the  usual  asymptotic  maximum  likelihood  theory.  Before  briefly  discussing  these 
approaches,  such  as  the  Wald  test  and  the  likelihood-ratio  test,  we  first  outline  the 
calculation  of  standard  error  estimates  for  the  algorithms  considered  in  the  previous 
section.  We  conclude  this  section  by  showing  how  adaptive  Gauss-Hermite  quadrature 
can  be  used  to  obtain  predictions  for  the  random  effects. 


76 


3.4.1  Standard  Errors 

We  now  provide  the  necessary  formulas  for  obtaining  standard  errors  of  \F  upon 
convergence  of  the  algorithms  in  the  previous  section.  We  obtain  these  estimates  by 
inverting  the  observed  information  matrix,  which  requires  calculation  of  the  second 
derivative  of  marginal  log-likelihood  of  the  general  multinomial  random  effects  model 
(i.e.  the  log  of  (3.19))  with  respect  to  'F  \F  . Hedeker  and  Gibbons  (1994)  based 
standard  errors  on  the  expected  information  matrix  for  their  general  ordinal  random 
effects  model.  For  random  effects  models,  the  observed  information  matrix  is  often 
easier  to  calculate.  Efron  and  Hinkley  (1978)  argued  that  standard  errors  based  on  the 
observed  information  matrix  are  also  “closer”  to  the  real  data.  However,  the  observed 
information  matrix  is  not  guaranteed  to  be  positive  definite,  as  is  the  case  for  the 
expected  information  matrix.  Tutz  and  Hennevogl  (1996)  based  standard  errors  on 
the  estimated  information  matrix  ^”=1  St(4^)sj(4r),  where  Sj(4j)  is  the  contribution 
of  the  ?th  cluster  to  the  approximated  score  function.  They  warned,  however,  that 
this  approach  has  a tendency  of  overestimating  the  true  standard  errors.  In  fact,  we 
will  see  in  Section  3.6.1  that  this  approach  can  be  extremely  inaccurate.  Denoting 
the  marginal  log-likelihood  by  /(\F),  the  matrix  of  second  derivatives  has  the  form 

d2i(^>)  d2i( T') 

d2l{^)  _ dpdp'  dfidve  ch(E)' 

ebT  d\ !>'  d2i( <F)  d2i( SF) 

_dvech(E)d/3'  dvech(E)dvech(E)' 

As  each  component  of  (3.54)  contains  intractable  integrals,  adaptive  Gauss-Hermite 
quadrature  or  Monte  Carlo  integration  is  needed.  For  the  EM  algorithm,  Louis  (1982) 
showed  how  the  observed  information  matrix  can  be  obtained  using  items  already 
calculated  in  the  EM-steps.  Thus,  for  the  Monte  Carlo  EM  algorithm  we  use  Louis’ 
approach  for  obtaining  standard  errors,  while  for  the  quasi-Newton  algorithm  we 
directly  compute  (3.54). 


(3.54) 


77 


We  first  consider  direct  approximation  of  (3.54)  using  adaptive  Gauss-Hermite 
quadrature.  Denote  the  marginal  log-likelihood  Z('F)  as 


((*)  = !>  gii, 


(3.55) 


where  Li  is  given  in  (3.25).  Interchanging  integrals  and  derivatives,  the  elements  of 
(3.54)  can  be  written  in  the  form 


n r r>(eJ)  /^ieJ)  piieJ) 

E— 

i—1 


where  L = ^”=1  Lj.  For  the  (1,1)  element,  that  is 


Bihl)  = j ■■■  J /(y*  I /3;  Ui)pMVN(ui;  0,  E) 


d2/(*) 

dp  dp' ' 


x 


d2log/(y;  | /3;uj)  + dlog/fo  | /3;  u*)  dlog/(y*  | /3;u.) 


dp  dp 


dp 


dP' 


(3.56) 


dui,  (3.57) 


C',-1’1)  = J " • J /(y«  I P;  u,)tfara(ui;  0,  E) 


dlog/(yi  1 ft;ui) 

dp 


dui, 


and  Aa,1)  = Cf’1*'. 


(3.58) 


For  the  (2,2)  element,  that  is 


d2/(^) 


dvech(E)  dvech(E)'  ’ 

b?'2)  = J ■■■  J f(yi  I P‘,  Ui)^MvN(ui;  0,  E) 

rd2log5HVN(ui;0,  E)  dlog.gHVN(ui;0,E)dlog^HVN(ui;0,E) 


(dvech(E)  dvech(E) 


r + 


dvech(E) 


dvech(E)' 


dxii,  (3.59) 


^>(2,2)  f f r/~  | a \ / . d log  (llj,  0,  E) 

L'i  ~ I I U*)^M™(Ud  E) j u/\n\  dUj, 


dvech(E) 


(3.60) 


78 


and  £>|2’2)  = Cf2)'. 


Lastly,  for  the  (1,2)  element,  that  is 


< i2l{ V) 


d(3  dvech(E)'  ’ 


dlog/(y<  | j0;Ui)dlog(favi(u*;O,E)  , 

d(3  dvech(E)'  ” 

(3.61) 


Thus  for  each  element  in  (3.54)  there  are  two  sets  of  integrals  to  approximate. 
The  adaptive  Gauss-Hermite  approximation  of  the  first  set  of  integrals  found  in  the 
marginal  likelihood  L has  already  been  given  in  (3.29)  and  is  calculated  here  us- 
ing the  final  parameter  estimates  4'.  Let  Sjf  = (<j*),  • • • ,£*)),  l = 1,-  • • ,K,  denote 
the  centered  and  scaled  nodes  used  in  approximating  L,  with  corresponding  curva- 
tures (EJ,  • ■ • , E*).  We  then  approximate  the  second  set  of  integrals  by  using  these 
same  nodes  along  with  the  final  parameter  estimates.  This  parallels  what  is  done 
in  Monte  Carlo  methods.  That  is,  random  samples  drawn  from  the  marginal  log- 
likelihood  at  the  final  iteration  are  used  to  evaluate  the  observed  information  matrix. 
We  have  attempted  to  approximate  each  set  of  integrals  individually.  However  the 
integrand  of,  for  example,  (3.57)  is  often  ill-behaved  causing  adaptive  Gauss-Hermite 
quadrature  to  perform  poorly.  We  have  already  given  approximation  formulas  for 
some  of  the  remaining  terms  in  (3.56).  The  approximations  for 
and  {Cf2,2\  d[2,2^}  can  be  obtained  from  slight  modifications  of  (3.35)  and 

(3.38),  respectively.  Replicating  the  data  vectors  y ^ and  x^-  Km  times,  the  adaptive 


79 


quadrature  approximation  of  is  given  by 


K 


Ti 


^ ^ Zij\  Diji  R-jrtjl  Ziji  { y ^ Zjj\Uij\TZij\(yij\r  n ij\T ) 


.3= 1 


<r= 1 


+ 


K Ti 

^ I ^ ] ^ijl  Diji  R-TV^f  (yp'l  — ^ijl) 
1 j=l 


AT  Ti 

V!  Diji  D-7T,jt  (?ijl  ~ nijl) 

L l j=i 


(3.62) 


where  summation  is  over  1 = (li,  ■ • • , lm)  and 


c*,  = 


Ti 


n/(ybi  I #<a) 

U=i 


£mvn«i;0,E)  I E*  |1/2  2m/2  w 1 exp(<j|  q).  (3.63) 


Also,  the  approximation  of  B-1,2^  is  given  by 


K 


B. 


(1.2) 


/f  7\ 


'y  y %ijl  Diji  R-Kiji  (y«l  ^ijl) 

L 1 j=! 


X 


-E-1  + i diag(E-‘)  + E->  <*,  <*;  E-1  - t diag(E-1  c*,  4 S"1)]  , (3.64) 


with  c*,  defined  as  in  (3.63).  The  final  approximation  is  that  of  B,-2’2),  which  requires 

the  second  derivative  of  the  log  of  a multivariate  normal  density  with  respect  to  the 

unique  elements  of  E.  We  have  already  seen  the  general  form  of  this  derivative  in 

(3.51),  approximated  by  Monte  Carlo  methods.  The  form  that  is  needed  here  is 

slightly  different  than  (3.51),  however.  Let  Mu  denote  equation  (3.51)  with  the  jf 

and  Yli=i  removed,  summation  over  l changed  to  1,  and  ^ replaced  with  Then 

(2  2) 

the  approximation  of  B\  ’ ’ is  given  by 


r(2>2) 

K 

Y Cil  \ + 


-E-1  + i diag(E-')  + S'1  4 E-‘  - i diag(E-‘  <,  ^ E_I) 


-E-1  + 1 diag(E-1)  + E-‘  <*;  E-1  - 1 diagfE"1  E_1) 


(3.65) 


80 


Using  estimates  (3.62),  (3.64),  and  (3.65)  along  with  the  approximations  for  the 
other  elements  of  (3.56),  one  can  obtain  an  estimate  of  the  observed  information 
matrix  (3.54)  upon  convergence.  By  inverting  the  negative  of  (3.54),  the  estimated 
asymptotic  variance-covariance  matrix  for  4'  is  obtained. 

In  the  context  of  the  EM  algorithm,  Louis  (1982)  showed  that  the  observed  infor- 
mation matrix  could  be  calculated  from 

= Q1 (2)(^  I + var{log/(y,  u;  tf)  | y;  £},  (3.66) 

aw  flW 


where  the  variance  is  with  respect  to  h(u  | y).  Upon  convergence,  the  final  Monte 

Carlo  sample,  <;*,  along  with  the  final  parameter  estimates,  4*,  can  be  used  to  estimate 
(3.66).  Note  that  the  Monte  Carlo  estimate  of  the  first  term  in  (3.66)  was  previously 
given  in  (3.48).  The  Monte  Carlo  approximation  to  the  second  term  has  the  form 


1 K 

var{log /(y,  u;  tf)  | y;4^}  « — ^ 

i=i 


^iog/(y,<?*;  4p 

cbJ> 


d^' 


1 Adlog/(y,^;4Q 

k ^ d<a 
1=1 


1 A rflog/(y,sf;4Q 

Kh  d *' 


(3.67) 


The  necessary  equations  for  the  derivatives  in  (3.67)  are  given  in  (3.46)  and  (3.47). 
Taking  the  negative  of  (3.66)  and  inverting  gives  the  desired  estimate  of  the  variance- 
covariance  matrix. 


3.4.2  Maximum  Likelihood  Inference 

Inference  concerning  the  fixed  effects  parameters  in  the  multinomial  random  ef- 
fects model  is  accomplished  using  standard  asymptotic  maximum  likelihood  theory. 
Such  an  approach  is  justified  as  long  as  the  approximation  based  algorithms  used  are 
accurately  approximating  the  intractable  integrals.  In  theory  one  can  choose  a large 
enough  Monte  Carlo  sample  size  or  enough  quadrature  points  to  obtain  the  true  max- 
imum likelihood  estimates.  We  assume  that  such  accuracy  has  been  reached.  A thor- 
ough review  of  asymptotic  maximum  likelihood  theory  can  be  found  in  Prakasa  Rao 


81 


(1987).  We  briefly  review  the  theory  for  independent  but  not  identically  distributed 
random  variables. 

Let  y = (y1;  • • • , yn)  be  independent  but  not  identically  distributed  random  vari- 
ables. Let  ln(/3),  sn((3),  FE,n{l3),  and  F0,n(P)  be  the  log-likelihood,  score  function, 
and  expected  and  observed  information  matrices  for  the  entire  sample  y.  Note  that 
these  equations  are  given  in  Section  2.2.  We  consider  local  maximum  likelihood  esti- 
mates 0n  for  (3  in  the  interior  of  the  parameter  space.  Though  there  is  no  guarantee 
that  a maximum  of  ln((3)  will  exist  or  that  local  and  global  maxima  will  coincide,  for 
many  important  models  local  and  global  maxima  are  identical  and  unique  if  they  ex- 
ist (see,  e.g.,  Kaufmann  1988  for  the  consideration  of  multicategorical  models).  The 
standard  n1/,2-asymptotics  that  we  discuss  hold  under  typical  regularity  conditions, 
one  such  condition  being  that  FEtn((3)/n  converges  to  a positive  definite  limit: 

FE,n(P)/n  = E[F0,nm/n  ->  F((3). 

The  following  asymptotic  results  can  be  shown  to  hold  under  the  regularity  as- 
sumptions. The  score  function  sn(/3 ) is  asymptotically  normal 

n"1/2s„(/ 3)  ±>N(0,F(/3)).  (3.68) 

The  maximum  likelihood  estimate  0n  asymptotically  exists  and  is  asymptotically 
consistent  and  normal 

nl'\0n-P)  — >^(0,  F((3)~l).  (3.69) 

Using  these  results,  one  can  obtain  asymptotic  distributions  for  the  likelihood-ratio, 
Wald,  and  score  statistics. 

Consider  testing  the  following  linear  hypothesis: 


H0:Cp  = versus  Ha:C/3  ± /30, 


82 


where  C has  full  row  rank  s < p,  the  dimension  of  (3.  The  likelihood-ratio  statistic, 

Airt  = — 2[in(/3n)  - ln(pn)},  (3.70) 

compares  the  likelihood  value  under  the  alternative  hypothesis  where  $n  is  the  max- 
imum likelihood  estimate,  to  the  likelihood  value  under  the  null  hypothesis  where  (3n 
is  the  maximum  likelihood  estimate.  The  Wald  statistic, 

Xw  = (C0„  - ft,)'  [C  F? (ft.)  C']-‘  (Cft,  - ft),  (3.71) 

compares  the  distance  between  the  unrestricted  estimate  C(3n  and  its  value  under 
the  null  hypothesis.  The  score  statistic, 

As  = /(ft)  F~HK)  *(ft),  (3.72) 

compares  the  score  function  for  the  unrestricted  model  evaluated  under  the  null  hy- 
pothesis maximum  likelihood  estimate  f3n  to  zero.  Asymptotic  x2  distributions  of 
the  likelihood-ratio,  Wald,  and  score  statistics  can  be  derived  by  expanding  the  log- 
likelihood  ln{/3)  in  a Taylor  series  about  (3n  and  using  results  (3.68)  and  (3.69). 
Asymptotically  the  three  tests  are  equivalent  under  the  null  and  have  the  same  lim- 
iting x2s  distribution  for  the  hypothesis  given  above. 

For  the  multinomial  random  effects  models,  the  observed  information  matrix 
is  easier  to  calculate  then  the  expected  information  matrix.  Since  Ffin(/3)/n  and 
E[F0,n{f3)\/n  converge  to  the  same  positive  definite  limit,  the  observed  information 
matrix  can  be  inserted  into  (3.71)  and  (3.72).  Only  approximations  are  available  for 
the  components  in  (3.70),  (3.71),  and  (3.72),  and  we  assume  that  they  have  been  ad- 
equately approximated.  For  testing  of  variance  components,  the  asymptotics  for  the 
Wald  test  and  likelihood-ratio  test  break  down  (Self  and  Liang  1987;  Bryk  and  Rau- 
denbush  1992,  p.  55),  as  the  test  involves  the  boundary  of  the  parameter  space.  The 


83 


score  test,  however,  is  not  affected  by  such  conditions  (Chant  1974).  In  Chapter  5 we 
will  consider  two  approximate  score  tests  for  testing  individual  variance  components. 

3.4.3  Prediction 

Besides  estimation  of  the  fixed  effects  parameter  vector  f3,  one  might  also  be 
interested  in  predicting  the  values  of  the  random  effects  vector  Uj,  i = 1,  • • • , n,  or 
linear  combinations  of  fixed  and  random  effects.  Such  predictions  are  based  on  the 
conditional  expectation  of  the  random  effects  given  the  data  and  the  final  parameter 
estimates.  For  the  multinomial  random  effects  model,  this  expectation  is  of  the  form 

Fr„  i„  ,ft1  _ 1 Wu»;Q,£)  dUi 

f ■ ■ ■ f f(yi  I Ui)  5«vN(ui;  0,  E)  dui 

where  yi  = (ya,---  , yiT.).  The  expectation  in  (3.73)  requires  integral  approxima- 
tions, thus  adaptive  Gauss-Hermite  quadrature  or  Monte  Carlo  integration  can  be 
employed.  Though  (3.73)  involves  only  the  data  for  subject  i,  the  estimates  of  the  Uj 
“borrow”  information  from  all  of  the  subjects  since  4'  is  obtained  from  the  complete 
data. 

Booth  and  Hobert  (1998)  discussed  the  calculation  of  standard  errors  for  predic- 
tions involving  random  effects.  For  linear  mixed  models,  standard  errors  of  prediction 
are  typically  based  on  Var(u;  | 4>;  y;).  Booth  and  Hobert  (1998)  showed  that  this  ap- 
proach was  inappropriate  for  mixed  models  with  non-normal  responses,  and  proposed 
the  use  of  the  conditional  mean  squared  error  of  prediction  (CMSEP).  The  CMSEP 
takes  into  account  the  variability  associated  with  \F.  A Taylor  series  approximation 
to  this  correction  factor  can  be  found  in  Booth  and  Hobert  (1998). 

3.5  Pseudo-Likelihood  Estimation 

The  estimation  methods  proposed  in  Section  3.3  utilized  numerical  integration 
techniques  to  carry  out  maximum  likelihood  estimation.  Though  based  on  approxi- 
mations, estimates  obtained  from  these  methods  can  be  considered  “exact”  maximum 


84 


likelihood,  since,  in  theory,  one  can  increase  the  Monte  Carlo  sample  size  or  num- 
ber of  quadrature  points  until  a desired  accuracy  is  reached.  In  contrast,  we  now 
consider  an  approximate  method  for  finding  maximum  likelihood  estimates  for  the 
multinomial  random  effects  model.  An  advantage  of  the  approximate  method  is  that 
it  avoids  the  intractable  integrals  completely.  Thus,  estimation  does  not  require  the 
computationally  intensive,  numerical  integration  methods  needed  in  Section  3.3.  The 
estimation  routine  is  also  attractive,  as  it  patterns  that  used  in  standard  linear  mixed 
models.  We  begin  by  reviewing  some  of  the  recent  literature  on  approximate  methods 
for  generalized  linear  mixed  models.  We  then  present  the  estimation  routine  for  the 
multinomial  random  effects  model.  We  conclude  this  section  with  some  discussion 
concerning  the  proposed  model. 

3.5.1  Approximate  Inference  in  Generalized  Linear  Mixed  Models 

There  have  been  a number  of  proposals  for  approximate  maximum  likelihood  es- 
timation in  generalized  linear  mixed  models.  Breslow  and  Clayton  (1993)  proposed  a 
penalized  quasi-likelihood  (PQL)  approach  for  fitting  generalized  linear  mixed  mod- 
els. Under  the  assumption  of  normality  for  the  random  effects,  they  replaced  the 
integrated  quasi-likelihood  with  a quadratic  Laplace  approximation  in  terms  of  the 
current  estimates  of  the  random  effects.  Certain  terms  in  the  approximation  were 
assumed  to  vary  slowly  enough,  as  a function  of  the  mean  of  the  generalized  linear 
model,  so  that  they  could  be  ignored.  The  quasi-likelihood  deviance  terms  in  the 
approximation  were  also  replaced  with  estimated  Pearson  residuals.  Breslow  and 
Clayton  (1993)  proposed  a Fisher  scoring  algorithm  for  estimation  of  the  fixed  and 
random  effects  and  REML  estimation  for  the  variance  component  estimation.  The 
estimating  equations  corresponded  to  those  obtained  by  Harville  (1977)  for  best  lin- 
ear unbiased  estimation  in  the  associated  normal  theory  model.  In  addition,  Breslow 
and  Clayton  (1993)  proposed  a marginal  quasi-likelihood  (MQL)  approach  for  mod- 
eling the  marginal  mean  in  a generalized  linear  mixed  model.  The  PQL  and  MQL 


85 


approaches  correspond  to  the  subject-specific  and  population-averaged  approaches  of 
Zeger  et  al.  (1988). 

Engel  and  Keen  (1994)  proposed  a similar  method  to  that  of  the  PQL  approach. 
Motivated  from  a quasi-likelihood,  they  utilized  iteratively  re-weighted  least  squares 
and  iterated  MINQUE  to  estimate  the  fixed  and  random  effects,  and  the  variance 
components.  In  contrast  to  Breslow  and  Clayton  (1993),  Engel  and  Keen  (1994)  al- 
lowed for  an  additional  overdispersion  parameter.  They  used  a method  of  moments 
estimator  for  updating  the  overdispersion  parameter  in  which  they  equated  the  Pear- 
son’s chi-square  statistic  to  its  degrees  of  freedom. 

Wolfinger  and  O’Connell  (1993)  presented  an  approximate  method  for  fitting  gen- 
eralized linear  mixed  models  using  a pseudo-likelihood  (PL)  approach.  A PL  proce- 
dure is  based  on  the  following  concept.  For  parameters  G and  /3,  PL  estimates  of  6 
are  found  by  treating  the  parameters  (3  as  known  and  equal  to  their  current  value, 
and  then  estimating  G by  maximum  likelihood.  Wolfinger  and  O’Connell  (1993) 
considered  a model  that  allowed  covariance  structures  for  modeling  both  population- 
averaged  and  subject-specific  associations.  In  terms  of  the  PL  concept,  the  elements 
of  the  covariance  matrices  corresponded  to  G and  the  fixed  and  random  effects  corre- 
sponded to  (3.  In  their  approach,  the  fixed  and  random  effects  were  estimated  from 
a linear  mixed  model  based  on  an  approximately  normal  pseudo-response  variable. 
Then  the  elements  of  the  covariance  matrices  were  updated  using  either  maximum 
likelihood  or  REML.  Similarly  to  Engel  and  Keen  (1994),  they  also  allowed  for  an 
additional  overdispersion  parameter.  When  the  overdispersion  parameter  is  forced  to 
be  1.0  and  the  population-averaged  covariance  matrix  is  ignored,  the  PQL  approach 
of  Breslow  and  Clayton  (1993)  and  Engel  and  Keen  (1994),  and  the  PL  approach  of 
Wolfinger  and  O’Connell  (1993)  are  equivalent. 

Keen  and  Engel  (1997)  extended  the  methods  used  in  Engel  and  Keen  (1994) 
to  threshold  models  for  ordinal  responses.  The  threshold  models  were  motivated 


86 


from  an  underlying  continuous  response  which  followed  a linear  mixed  model  (see 
Section  3.2.2).  They  assumed  that  the  residuals  for  the  underlying  linear  mixed 
model  were  normally  distributed,  resulting  in  a cumulative  probit  random  effects 
model.  Additionally,  they  considered  a more  general  class  of  link  functions  in  which 
the  distribution  of  the  residuals  was  allowed  to  depend  on  a set  of  additional  shape 
parameters.  Specifically,  they  assumed  that  the  residuals  followed  a t-distribution. 
Keen  and  Engel  (1997)  showed  that  improved  fits  for  particular  datasets  could  be 
just  as  easily  obtained  by  changing  the  link  function  for  the  cumulative  probabilities 
as  by  introducing  variability  in  the  thresholds. 

Though  attractive  for  their  computational  simplicity,  these  approximate  methods 
for  generalized  linear  mixed  models  have  been  shown  to  be  biased  for  Bernoulli  and 
binomial  response  data.  Breslow  and  Clayton  (1993)  reported  that  the  accuracy  of 
the  regression  coefficients  improved  as  the  binomial  denominators  increased.  Bres- 
low and  Lin  (1995)  and  Lin  and  Breslow  (1996)  studied  the  asymptotic  bias  of  the 
variance  components  and  regression  coefficients,  and  proposed  bias  correction  factors 
for  adjusting  these  estimates.  Engel  (1998)  used  a simple  probit-normal  model  with 
two  Bernoulli  observations  per  cluster  and  the  overall  mean  as  the  only  fixed  effect 
to  show  the  severity  in  bias  that  can  occur  in  the  variance  component  estimation. 
In  general,  the  approximate  methods  such  as  PQL  will  perform  adequately  when  the 
binomial  sample  sizes  are  large  and  the  variance  components  are  small  to  moderate 
in  size  (Engel  and  Keen  1994). 

3.5.2  Pseudo-Likelihood  Estimation  for  Multinomial  Random  Effects  Models 

We  now  generalize  the  PL  estimation  approach  of  Wolfinger  and  O’Connell  (1993) 
to  multivariate  generalized  linear  mixed  models  for  nominal  and  ordinal  response  data. 
To  motivate  the  PL  algorithm,  we  consider  the  model  for  the  complete  response  vector 


87 


y = [y.j], 

y — tt  + e,  (3.74) 

with  link  function  given  by 

g(7r)  = Z0  + Wu, 

where  Z = [ Zij ],  W = diag(l'Fjj),  and  u = [«*].  It  is  assumed  as  before  that  u,  is 
multivariate  normal  with  mean  0 and  covariance  E,  i = 1,  • • • , n.  Also,  e = [e^]  is  a 
vector  of  unobserved  errors  with  Efaj  | 7 r^)  = 0 and  cov(ejj  | tt^)  = RlJ0t]  RR}J0^. 
For  the  multinomial  models  considered  here,  the  form  of  Rn^  is  given  in  Section 
2.3.  The  additional  unknown  covariance  matrix  R is  included  for  the  modeling  of 
population-average  associations.  For  the  complete  data  model,  we  define  the  cov(u) 
= E and  cov(e  | 7r)  = R^HR1^2,  where  Rn  = diag(i?7rij),  and  E and  R are 
diagonal  matrices  with  E and  R on  the  diagonals,  respectively. 

The  PL  procedure  is  carried  out  by  iteratively  fitting  a weighted  Gaussian  linear 
mixed  model  to  a modified  response  vector.  To  achieve  this,  a number  of  approxima- 
tions are  required.  First,  for  known  estimates  of  0 and  u,  let  the  estimated  response 
probabilities  be 

7r  = h(f))  = h (Z(3  4-  VFu), 

where  h(-)  is  the  response  function  for  the  desired  model.  Then  a Taylor  series 
approximation  to  the  residuals  e = y — 7r  from  (3.74)  about  the  current  estimates  0 
and  u is  given  by 

/-  ~ x / .x'  d{ y — 7r) 

e « e = (y  — 7r)  — (ry  — 77)  — 

= (y  - tt)  - (Z0  -Z0  + Wu-  Wu)'D,  (3.75) 


where  D = diag(ZAj),  and  Dq  is  defined  as  in  Chapter  2. 


88 


Next,  we  approximate  the  conditional  distribution  of  e | /3,  u in  (3.75)  with  a 
Gaussian  distribution  that  has  the  same  first  two  moments  as  e | (3,  u.  Thus  we 
assume  that 

e Ifru  ~ MVN(  0,  R R R ^2).  (3.76) 

Then,  using  (3.75),  (3.76),  and  approximating  7r  in  R^2  R R1^2  with  it,  we  obtain 

D~1' (y  — 7r)  ~ MFiV[Z/3  - ZQ  + Wu  - Wu,  D~l  rV2  R rJ/2  ZT1'].  (3.77) 

It  follows  from  (3.77)  that  the  approximate  “pseudo”  observation  vector, 

y = g(7r)  + D_1'(y  - tt),  (3.78) 

has  the  approximate  conditional  distribution 

y | /3,u~  MVN[Z(3  + Wu,  D~l  rV2  R rJ/2  IT1'].  (3.79) 

Treating  [3  as  an  unknown  parameter,  (3.79)  is  of  the  form  of  a weighted  linear  mixed 
model  with  response  y and  weight  matrix  given  by  W = D'  Rt1  D.  Note  that  the 
modified  response  y in  (3.78)  is  analogous  to  the  pseudo-response  defined  in  Section 
2.3  for  iteratively  re- weighted  least  squares. 

The  log-likelihood  corresponding  to  (3.79)  is  easily  obtained.  Following  Wolfinger 
and  O’Connell  (1993),  we  insert  an  additional  dispersion  parameter  </>  into  the  log- 
likelihood  and  re-express  the  covariance  matrices  £ and  R as  S*  = </>_1£  and  R*  = 
</>-1R.  The  dispersion  parameter  is  analogous  to  that  used  in  quasi-likelihoods  and 
can  be  forced  to  be  1.0  if  unneeded.  The  resulting  log-likelihood  is 

Z(/3, 0,  £*,  R*)  = -i  log  I 4V  I y - Z/3)'  F’1  (y  - Zf3)  - ^ \og(2n), 

(3.80) 


89 


where 

V = W~1/2  R*  W~1/2  + W £*  W'.  (3.81) 

Closed  form  solutions  for  (3  and  (f)  that  maximize  (3.80)  exist  and  are  given  by 

b = {Z'  U”1  Z)~l  Z F_1y,  (3.82) 

and 

0 = r V~l  f,  (3.83) 

n — p 

where  r = y — Z(Z,‘  V _1  Z)~l  Z'  V~l  y.  To  obtain  the  estimates  of  £*  and  R*  in 
V,  the  restricted  profile  likelihood  can  be  maximized 

WS*,R')  = -5  log  I V I log(r'  V~l  r)  - i log  | Z'  V~l  Z | 

jp-{1  + 1°g [2ir/(™  - p)]}-  (3-84) 

Maximization  of  (3.84)  provides  REML  estimates  for  £*  and  R*.  An  estimate  of  u 
can  then  be  found  from 

u = S*  W'  V~l  f.  (3.85) 

Wolfinger  and  O’Connell  (1993)  referred  to  the  PL  algorithm  that  utilized  the 
REML  estimation  (3.84)  as  a restricted-PL  (REPL)  procedure.  Using  this  terminol- 
ogy, the  REPL  algorithm  proceeds  as  follows: 

0.  Calculate  initial  estimates  for  (3^  and  vd0).  Set  = R(°)  = /. 

For  s = 0, 1,  • • • 

1.  Calculate  the  modified  response  y(*l  using  (3 ^ and  u^b 

2.  Maximize  (3.84)  to  obtain  £ and  R*.  Convert  £*  and  R*  to  £(s^  and  R^ 
using  from  (3.83). 


90 


3.  If  change  from  (E^-1\  R^®-1))  to  (E^,R^)  is  small,  then  stop.  Otherwise, 
compute  (3{s+1)  and  vds+1)  from  (3.82)  and  (3.85)  and  go  to  1. 

Upon  convergence,  an  estimate  of  the  approximate  covariance  matrix  for  f3  and  u 
can  be  obtained  by  inverting 


Z‘  VV1/2  Rr1  VV1/2  Z Z'  VV1/2  Rr1  VV1/2  W 
W'  W1/2  R1  VV1/2  Z VU'W1/2  R 1 VV1/2  W + S 


(3.86) 


3.5.3  Discussion 

In  the  motivation  of  the  algorithm  in  the  previous  section,  we  allowed  for  both  a 
subject-specific  covariance  matrix  and  a population-average  covariance  matrix.  For 
the  models  discussed  in  this  thesis,  only  the  subject-specific  covariance  term  is  con- 
sidered. Wolfinger  and  O’Connell  (1993)  discussed  how  one  might  use  these  matrices 
individually,  but  did  not  elaborate  on  how  one  would  interpret  parameters  if  both 
were  used  in  the  same  model.  We  also  included  the  overdispersion  parameter  as 
proposed  by  Wolfinger  and  O’Connell  (1993).  To  allow  for  comparisons  between  the 
exact  maximum  likelihood  methods  and  the  proposed  approximate  method,  we  force 
the  overdispersion  parameter  to  be  1.0. 

We  have  noted  before  that  the  approximate  maximum  likelihood  methods  can 
perform  poorly  for  binomial  responses  with  small  sample  sizes.  We  would  suspect 
that  the  proposed  model  will  run  into  similar  problems  when  the  multinomial  sample 
size  is  small.  As  an  example  of  this,  Table  3.1  contains  the  estimates  from  both  the 
REPL  algorithm  and  the  adaptive  Gauss-Hermite  algorithm  for  a dataset  taken  from 
Agresti  and  Lang  (1993).  The  dataset  originated  from  the  1989  General  Social  Survey 
in  which  subjects  were  asked  their  opinion  on  (a)  teens  having  sexual  relations  before 
marriage,  (b)  a man  and  a women  having  sexual  relations  before  marriage,  and  (c) 
a married  person  having  sexual  relations  with  someone  other  than  their  spouse.  A 
total  of  475  subjects  responded  to  each  of  the  three  questions  using  the  response  scale 


91 


“Always  wrong”,  “Almost  always  wrong”,  “Wrong  only  sometimes”,  and  “Not  wrong 
at  all”.  The  results  in  Table  3.1  were  obtained  by  fitting  a cumulative  logit  model 
with  a random  intercept  to  account  for  the  correlation  between  a given  subject’s 
responses.  For  the  ith  subject  and  the  jth  question,  the  linear  predictor  has  the  form 

rjijr  = ar+  piXiji  + /32Xij2  + uu  r = 1,  • • ,4,  j = 1,  — ,3, 

where  and  are  one  if  response  pertains  to  teenage  or  premarital  sex,  re- 
spectively, and  zero  otherwise.  Thus,  /3i  and  /?2  are  the  corresponding  regression 
parameters,  the  {ay}  are  the  threshold  parameters,  and  rq  is  assumed  to  be  normally 
distributed  with  mean  zero  and  standard  deviation  a.  We  include  this  example  to 
point  out  the  disparity  in  the  estimates  of  the  standard  deviation  of  the  random  ef- 
fect. The  estimate  from  the  REPL  algorithm  is  substantially  smaller  than  that  of 
the  adaptive  quadrature  algorithm.  The  fixed  effects  parameter  estimates  and  stan- 
dard errors  differ  as  well,  though  statistical  conclusions  from  the  two  models  would 
be  same.  Note  that  the  log-likelihood  values  for  the  two  models  are  not  comparable 
and  that,  as  can  be  seen  from  the  approximate  covariance  matrix  (3.86),  the  REPL 
algorithm  does  not  provide  standard  error  estimates  for  the  variance  components. 

As  has  been  done  for  the  binary  case,  further  research  is  needed  to  investigate  the 
accuracy  of  the  REPL  approach  for  multinomial  models.  Bias  correction  terms,  such 
as  those  proposed  by  Lin  and  Breslow  (1996),  could  also  be  examined.  At  the  very 
least,  the  approximate  methods  considered  here  can  be  used  to  calculate  starting 
values  for  the  exact  algorithms  considered  previously.  Their  speed  and  simplicity 
also  makes  them  ideal  for  doing  exploratory  analyses  prior  to  fitting  a full  “exact” 
maximum  likelihood  analysis. 


3.6  Applications 

We  now  consider  three  examples  to  illustrate  the  fitting  methods  discussed  in  Sec- 
tions 3.3  and  3.5.  Our  intent  is  not  to  provide  a thorough  analysis  of  each  dataset, 


92 


Table  3.1:  Parameter  estimates  for  fitting  a cumulative  logit  model  with  a random 
intercept  to  sexual  opinion  dataset  Agresti  and  Lang  (1993).  Results  are  shown  for 
the  10-point  adaptive  Gauss-Hermite  algorithm  (AGH(IO))  and  the  restricted  pseudo- 
likelihood algorithm  (REPL) 


AGH(IO) 

REPL 

<*1 

2.652 

1.922 

3.786 

2.890 

a 3 

5.435 

4.267 

A 

-0.571 

(.187) 

-0.455 

(.172) 

@2 

-4.378 

(.262) 

-3.340 

(.169) 

a 

2.267 

(0.191) 

1.592 

LL 

-1218.429 

-7822.265 

but  to  simply  apply  the  methods  to  some  specific  models.  The  data  in  Table  3.2 
are  from  a wine  tasting  experiment  (Randall  1989)  in  which  wine  preferences  were 
measured  on  an  ordinal  scale.  To  account  for  possible  heterogeneity  of  judges,  Tutz 
and  Hennevogl  (1996)  fit  a cumulative  logit  random  intercept  model  utilizing  their 
EM  Gauss-Hermite  and  EM  Monte  Carlo  algorithms.  We  fit  the  same  model  with 
our  algorithm  to  allow  for  comparisons  between  the  approaches.  We  also  illustrate 
our  methods  using  an  adjacent-category  logit  random  intercept  model.  The  second 
dataset  in  Table  3.3  arose  from  a developmental  toxicity  study  using  litters  of  mice 
conducted  under  the  U.S.  National  Toxicology  Program  (Price  et  al.  1985).  The  three 
possible  outcomes  in  the  study  (Dead/Resorption,  Malformation,  Normal)  have  a nat- 
ural sequential  ordering  which  lends  itself  to  a continuation-ratio  logit  model.  We 
incorporate  random  effects  to  account  for  the  correlations  between  fetuses  in  the  same 
litter.  The  final  dataset,  Table  3.4,  is  from  the  1975  U.S.  General  Household  Survey 
in  which  subjects  indicated  their  degree  of  satisfaction  with  family  (F),  hobbies  (H), 


93 


Table  3.2:  Bitterness  of  wine  data  (Randall  1989)  classified  by  temperature,  presence 
or  absence  of  skin  contact,  and  bottle  number. 


Low  Temperature 

High  Temperature 

No  Contact 

Contact 

No  Contact 

Contact 

Judge 

Bottle 

1 

Bottle 

2 

Bottle 

1 

Bottle 

2 

Bottle 

1 

Bottle 

2 

Bottle 

1 

Bottle 

2 

1 

2 

3 

3 

4 

4 

4 

5 

5 

2 

1 

2 

1 

3 

2 

3 

5 

4 

3 

2 

3 

3 

2 

5 

5 

4 

4 

4 

3 

2 

3 

2 

3 

2 

5 

3 

5 

2 

3 

4 

3 

3 

3 

3 

3 

6 

3 

2 

3 

2 

2 

4 

5 

4 

7 

1 

1 

2 

2 

2 

3 

2 

3 

8 

2 

2 

2 

3 

3 

3 

3 

4 

9 

1 

2 

3 

2 

3 

2 

4 

4 

and  residence  (R)  (Clogg  1979).  Such  item  response  data  is  common  in  psychomet- 
ric literature.  Hedeker  (2000)  utilized  a baseline-category  logit  model  with  random 
effects  to  analyze  Table  3.4.  To  allow  for  comparisons,  we  fit  similar  random  effects 
models.  We  note  that  all  results  and  computing  times  reported  were  obtained  on  a 
Sun  Enterprise  450  computer  which  had  four  400  MHz.  processors  and  one  gigabyte 
of  RAM. 

3.6.1  Wine  Tasting  Experiment 

Table  3.2  contains  the  data  from  a study  on  the  bitterness  of  white  wine  (Randall 
1989).  Of  interest  in  the  study  was  whether  certain  factors  that  can  be  controlled 
during  the  pressing  of  the  grapes  influenced  the  bitterness  of  the  wine.  The  factors 
considered  were  the  temperature  during  pressing  and  whether  there  was  contact  of 
the  juice  with  the  skin  when  the  grapes  were  crushed.  Temperature  was  considered  as 
either  high  or  low  and  contact  was  measured  by  presence  or  absence.  At  each  of  the 
temperature/contact  combinations,  two  bottles  of  white  wine  were  randomly  chosen 
and  the  bitterness  of  each  was  classified  on  a five-point  ordinal  scale  from  least  to 
most  bitter.  For  this  factorial  experiment,  nine  professional  judges  were  chosen  to 


94 


Table  3.3:  Developmental  toxicity  data  (Price  et  al.  1985)  classified  by  ethylene  glycol 
dosage  and  frequency  of  fetus  outcome  (D/R  = Dead/Resorption,  M = Malformation, 
N = Normal). 


Litter 

Ethylene  Glycol  Dosage  (g/kg) 

0.00 

0.75 

1.50 

3.00 

D/R 

M 

N 

D/R 

M 

N 

D/R 

M 

N 

D/R 

M 

N 

1 

1 

0 

7 

0 

3 

7 

0 

8 

2 

0 

4 

3 

2 

0 

0 

14 

1 

3 

11 

0 

6 

5 

1 

9 

1 

3 

0 

0 

13 

0 

2 

9 

0 

5 

7 

0 

4 

8 

4 

0 

0 

10 

0 

0 

12 

0 

11 

2 

1 

11 

0 

5 

0 

1 

15 

0 

1 

11 

1 

6 

3 

0 

7 

3 

6 

1 

0 

14 

0 

3 

10 

0 

7 

6 

0 

9 

1 

7 

1 

0 

10 

0 

0 

15 

0 

0 

1 

0 

3 

1 

8 

0 

0 

12 

0 

0 

11 

0 

3 

8 

0 

7 

0 

9 

0 

0 

11 

2 

0 

8 

0 

8 

3 

0 

1 

3 

10 

0 

0 

8 

0 

1 

10 

0 

2 

12 

0 

12 

0 

11 

1 

0 

6 

0 

0 

10 

0 

1 

12 

2 

12 

0 

12 

0 

0 

15 

0 

1 

13 

0 

10 

5 

0 

11 

3 

13 

0 

0 

12 

0 

1 

9 

0 

5 

6 

0 

5 

6 

14 

0 

0 

12 

0 

0 

14 

0 

1 

11 

0 

4 

8 

15 

0 

0 

13 

1 

1 

11 

0 

3 

10 

0 

5 

7 

16 

0 

0 

10 

0 

1 

9 

0 

0 

13 

2 

3 

9 

17 

0 

0 

10 

0 

1 

10 

0 

6 

1 

0 

9 

1 

18 

1 

0 

11 

0 

3 

10 

0 

2 

6 

0 

0 

9 

19 

0 

0 

12 

0 

0 

15 

0 

1 

2 

0 

5 

4 

20 

0 

0 

13 

0 

0 

15 

0 

0 

7 

0 

2 

5 

21 

1 

0 

14 

0 

2 

5 

0 

4 

6 

1 

3 

9 

22 

0 

0 

13 

0 

1 

11 

0 

0 

12 

0 

2 

5 

23 

0 

0 

13 

0 

1 

6 

0 

1 

11 

24 

1 

0 

14 

1 

1 

8 

25 

0 

0 

14 

95 


Table  3.4:  1975  U.S.  General  Household  Survey  data  (Clogg  1979)  concerning  degree 
of  satisfaction  with  family  (F),  hobbies  (H),  and  residence  (R)  on  a three-point  scale 
(l=Low,  2=Medium,  3=High). 

Response  Response  Response 

Profile  Profile  Profile 


F 

H 

R 

Frequency 

F 

H 

R 

Frequency 

F 

H 

R 

Frequency 

1 

1 

1 

15 

1 

2 

1 

3 

1 

3 

1 

5 

1 

1 

2 

11 

1 

2 

2 

12 

1 

3 

2 

14 

1 

1 

3 

7 

1 

2 

3 

5 

1 

3 

3 

16 

2 

1 

1 

16 

2 

2 

1 

23 

2 

3 

1 

18 

2 

1 

2 

26 

2 

2 

2 

58 

2 

3 

2 

38 

2 

1 

3 

12 

2 

2 

3 

31 

2 

3 

3 

27 

3 

1 

1 

23 

3 

2 

1 

45 

3 

3 

1 

64 

3 

1 

2 

49 

3 

2 

2 

117 

3 

3 

2 

191 

3 

1 

3 

54 

3 

2 

3 

126 

3 

3 

3 

466 

rate  each  of  the  eight  bottles  of  wine.  Since  the  judges  cannot  be  expected  to  have  the 
same  sensitivity  to  the  bitterness  of  wine,  one  would  expect  their  individual  ratings 
to  be  correlated.  An  analysis  of  this  data  should  account  for  the  heterogeneity  of 
judges. 

To  account  for  the  heterogeneity,  Tutz  and  Hennevogl  (1996)  fit  a cumulative  logit 
random  effects  model  that  allowed  the  judges  to  have  shifted  thresholds.  They  fit  a 
random  intercept  model  that  included  factors  for  temperature,  contact,  and  bottle. 
That  is  for  the  jth  evaluation  by  the  zth  judge, 

T]ijr  = OLr  + (3tE  xijl  + &CO  xij2  + &BO  xij3  + uii  (3.87) 

r = 1,  • • • , R - 1,  j = 1,  • • • , T,  i = 1,  • • • , n, 

where  R = 5,  T = 8,  and  n = 9.  In  (3.87)  /3te,  Pco,  and  /3bo  are  the  parame- 
ter coefficients  for  the  temperature,  contact,  and  bottle  factors,  respectively.  Tutz 
and  Hennevogl  (1996)  utilized  effect  coding  (effects  sum  to  zero)  for  the  covariates. 
Following  their  coding  scheme,  xzj\  = temperature  was  coded  as  (Low=l,  High=-1), 


96 


Table  3.5:  Parameter  estimates  and  log-likelihood  values  (LL)  for  fitting  model  (3.87) 
to  the  wine  tasting  dataset  using  the  cumulative  logit  link.  Numbers  in  column  labels 
denote  the  number  of  quadrature  points  or  Monte  Carlo  samples  used  in  the  adaptive 
Gauss-Hermite  (AGH),  Gauss-Hermite  EM  (GHEM)  (Tutz  and  Hennevogl  1996), 
and  Monte  Carlo  EM  (MCEM)  (Tutz  and  Hennevogl  1996)  algorithms.  MCEM^ 
refers  to  the  automated  Monte  Carlo  EM  algorithm,  REPL  refers  to  the  restricted 
pseudo-likelihood  algorithm,  and  FIXED  refers  to  the  fixed  effects  model  obtained 
by  omitting  the  random  effect. 


FIXED 

AGH(5) 

MCEM,* 

REPL 

GHEM(10) 

MCEM(20) 

«1 

-3.359 

-4.082 

-4.057 

-3.993 

-4.139 

-4.439 

a2 

-0.762 

-0.930 

-0.923 

-0.911 

-0.969 

-1.234 

1.456 

1.797 

1.787 

1.755 

1.777 

1.519 

Oq 

2.994 

3.657 

3.638 

3.585 

3.649 

3.368 

Ptb 

1.251 

1.536 

1.527 

1.501 

1.546 

1.549 

(.264) 

(.298) 

(.295) 

(.287) 

(.437) 

(1.092) 

Pco 

0.763 

0.916 

0.911 

0.894 

0.925 

0.925 

(.238) 

(.256) 

(.257) 

(.248) 

(.347) 

(.825) 

Pbo 

0.048 

0.122 

0.120 

0.120 

0.123 

0.126 

(.223) 

(.232) 

(.236) 

(.228) 

(.320) 

(.753) 

a 

— 

1.145 

1.105 

1.213 

1.243 

1.261 

(.401) 

(.397) 

- 

(.479) 

(.954) 

LL 

-86.469 

-81.394 

- 

- 

-81.365 

-80.437 

Xiji  = contact  was  coded  as  (No  Contact  = 1,  Contact  = -1),  and  xij3  = bottle  was 
coded  as  (Bottle  1 = 1,  Bottle  2 = -1). 

In  Table  3.5  are  the  results  of  fitting  model  (3.87)  with  the  adaptive  Gauss- 
Hermite  quadrature,  Monte  Carlo,  and  REPL  algorithms  given  in  Sections  3.3  and 
3.5.  Included  in  Table  3.5  are  the  results  reported  by  Tutz  and  Hennevogl  (1996)  as 
well  as  the  estimates  obtained  from  fitting  the  fixed  effects  model.  To  determine  the 
number  of  quadrature  points  for  the  adaptive  Gauss-Hermite  algorithm,  the  number 
of  nodes  was  successively  increased  until  the  difference  in  parameter  and  standard 
error  estimates  between  successive  fits  was  less  than  0.0001.  Five  quadrature  nodes 
were  found  to  be  sufficient  to  obtain  the  desired  accuracy.  The  algorithm  required 


97 


less  than  30  seconds  to  obtain  the  final  parameter  estimates  using  five  quadrature 
nodes.  The  automated  MCEM  algorithm,  starting  from  the  REPL  final  estimates, 
took  approximately  23  hours  to  obtain  convergence  starting  from  a simulation  size  of 
100  and  ending  at  99,365.  As  noted  before,  the  MCEM  algorithm  is  useful  for  models 
with  high  numbers  of  random  effects  but  is  generally  inefficient  for  low  to  moderate 
sized  models  when  compared  with  adaptive  Gauss-Hermite  quadrature.  The  REPL 
algorithm  required  less  than  30  seconds  to  obtain  convergence.  Due  to  the  large 
matrices  (the  design  matrix  Z for  the  entire  data  has  dimensions  288  by  7),  it  was 
necessary  to  fit  the  model  by  looping  over  clusters  instead  of  using  the  entire  data 
directly  as  in  the  definition  of  the  algorithm  in  Section  3.5.  The  final  two  columns  in 
Table  3.5  contain  the  estimates  reported  by  Tutz  and  Hennevogl  (1996)  using  their 
EM  algorithms. 

Examining  the  AGH(5)  column  in  Table  3.5,  it  is  clear  that  both  temperature 
and  contact  impact  the  perceived  bitterness  of  the  wine.  The  Wald  statistics  for 
these  two  factors  are  26.5  and  12.8,  respectively,  which  are  highly  significant  (P  < 
0.001).  The  positive  sign  of  both  the  temperature  and  contact  coefficients  indicates 
that  the  lower  temperature  and  no  contact  levels  are  associated  with  lower  perceived 
bitterness.  For  example,  holding  bottle  and  contact  fixed  the  odds  of  bitterness  being 
below  any  fixed  level  is  exp(2  * 1.536)  = 21.6  times  greater  for  low  temperature  than 
for  high  temperature  for  a given  subject.  The  estimated  standard  deviation  of  the 
random  effect  is  1.15,  indicating  that  the  judges  did  indeed  vary  with  respect  to  their 
perceived  bitterness.  Though  similar  inferential  results  are  obtained  from  both  the 
fixed  and  random  effects  models,  the  parameter  estimates  and  standard  errors  in  the 
random  effects  model  are  correctly  adjusted  for  the  unobserved  heterogeneity  of  the 
judges. 

The  results  of  the  five  different  algorithms  for  fitting  the  random  effects  model  are 
generally  in  agreement.  The  adaptive  MCEM  and  REPL  results  are  very  similar  to 


98 


the  adaptive  quadrature  results,  with  the  latter  approach  providing  a slightly  larger 
estimated  standard  deviation  for  the  random  effect.  The  parameter  estimates  from 
the  EM  algorithms  of  Tutz  and  Hennevogl  (1996)  are  generally  larger  than  those 
obtained  using  adaptive  quadrature.  In  addition,  the  standard  errors  are  much  larger 
than  those  estimated  by  the  AGH,  automated  MCEM,  and  REPL  methods.  Recall 
that  the  approach  for  finding  standard  errors  used  by  Tutz  and  Hennevogl  (1996)  has 
a tendency  to  overestimate  the  true  standard  errors.  However,  the  extremely  large 
standard  errors  for  their  MCEM  algorithm  (nearly  three  times  those  of  the  adaptive 
quadrature  method)  are  due  in  part  to  an  error  in  their  programming  (personal 
communication,  Tutz).  Though  not  reported  in  Table  3.5,  we  also  fit  model  (3.87) 
using  the  direct,  Gauss-Hermite  maximization  approach  of  Hedeker  and  Gibbons 
(1994).  Using  their  approach,  we  needed  25  quadrature  points  to  obtain  the  same 
results  that  were  obtained  using  five  point  adaptive  quadrature. 

For  this  particular  dataset,  one  might  also  be  interested  in  predicting  the  indi- 
vidual judge  effects.  Such  predictions  can  be  obtained  using  the  methods  described 
in  Section  3.4.3  or  through  the  approximate  REPL  procedure.  Table  3.6  contains 
the  predicted  judge  effects  using  adaptive  quadrature,  the  REPL  procedure,  and  the 
estimates  reported  by  Tutz  and  Hennevogl  (1996)  using  Gauss-Hermite  quadrature. 
We  also  include  the  estimates  obtained  by  treating  judge  as  a fixed  effect.  It  is  clear 
from  the  fixed  effect  estimates  that  the  judges  have  varying  scales  of  bitterness  with 
judges  1 (-2.42)  and  7 (2.83)  begin  quite  different  from  each  other.  The  predicted 
estimates  obtained  from  adaptive  quadrature  maintain  the  same  ordering  as  in  the 
fixed  effects  case,  but  are  smaller  in  magnitude.  For  example  the  thresholds  for  judge 
7 are  predicted  to  be  shifted  by  1.93  from  the  overall  thresholds  reported  in  Table  3.5. 
Such  smoothing  of  the  estimates  is  common  as  the  random  effect  approach  borrows 
information  across  all  judges  to  obtain  the  predictions.  The  approximate  REPL  ap- 
proach provides  very  similar  predictions  for  the  random  effects.  We  again  see  that  the 


99 


Table  3.6:  Predicted  judge  effects  from  the  cumulative  logit  model  (3.87)  with  stan- 
dard errors.  Numbers  in  column  labels  denote  the  number  of  quadrature  points  used 
in  the  adaptive  Gauss-Hermite  (AGH)  or  Gauss-Hermite  EM  (GHEM)  (Tutz  and 
Hennevogl  1996)  algorithms.  REPL  refers  to  the  restricted  pseudo-likelihood  algo- 
rithm. FIXED  refers  to  the  fixed  effects  model  obtained  by  omitting  the  random 
effect. 


Judge 

FIXED 

AGH(5) 

REPL 

GHEM(10) 

1 

-2.423 

(.716) 

-1.717 

(.759) 

-1.731 

(.704) 

-1.853 

(.505) 

2 

0.905 

(.666) 

0.598 

(.710) 

0.608 

(.694) 

0.650 

(.563) 

3 

-1.480 

(.670) 

-0.992 

(.725) 

-1.007 

(.694) 

-0.952 

(.622) 

4 

0.078 

(.659) 

0.052 

(.693) 

0.055 

(.692) 

0.101 

(.686) 

5 

-0.357 

(.659) 

-0.234 

(.687) 

-0.236 

(.691) 

-0.261 

(.648) 

6 

-0.713 

(.660) 

-0.473 

(.689) 

-0.478 

(.692) 

-0.498 

(.560) 

7 

2.830 

(.769) 

1.929 

(.816) 

1.954 

(.719) 

2.021 

(.494) 

8 

0.380 

(.660) 

0.273 

(.653) 

0.276 

(.692) 

0.395 

(.567) 

9 

0.780 

(.673) 

0.552 

(.672) 

0.559 

(.693) 

0.621 

(.484) 

GHEM  algorithm  of  Tutz  and  Hennevogl  (1996)  provides  generally  larger  estimates 
than  those  obtained  using  adaptive  quadrature.  Standard  errors  for  the  adaptive 
quadrature  procedure  are  based  on  an  approximation  to  conditional  mean  square 
error  of  prediction  described  by  Booth  and  Hobert  (1998).  Though  they  did  not 
report  how  they  calculated  the  standard  errors,  it  appears  that  Tutz  and  Hennevogl 
(1996)  did  not  adjust  for  the  variability  associated  with  using  estimates  for  the  model 
parameters,  which  resulted  in  smaller  standard  error  estimates. 

Instead  of  using  a cumulative  logit  link  for  modeling  Table  3.2,  one  could  also 
utilize  the  adjacent-category  link.  The  effects  in  the  cumulative  logit  model  refer  to 
the  entire  response  scale.  In  contrast,  the  effects  in  the  adjacent-category  logit  model 
refer  to  the  multiplicative  effect  of  a one-unit  increase  of  a predictor  on  the  odds  of 
response  in  the  higher  instead  of  the  lower  of  any  two  adjacent  categories.  Generally 
both  models  will  fit  well  in  similar  situations  as  they  both  imply  stochastic  orderings 
of  the  response  distributions  for  different  predictor  values.  Thus  the  choice  of  one 
link  over  another  will  depend  on  the  desired  interpretation. 


100 


Using  the  appropriate  design  matrix  and  link  function  from  Section  2.4.2,  we  fit 
model  (3.87)  using  the  adjacent-category  logit  link.  Results  are  shown  in  Table  3.7 
for  the  adaptive  quadrature  and  REPL  algorithms.  For  the  adaptive  quadrature 
algorithm,  eight  quadrature  points  were  required  to  obtain  the  desired  accuracy.  The 
inferential  results  are  substantively  the  same  for  the  adjacent-category  logit  model  as 
for  the  cumulative  logit  model.  We  again  see  that  both  temperature  and  contact  are 
associated  with  the  perceived  bitterness.  Holding  all  other  factors  constant  and  for 
a given  subject,  the  odds  for  the  low  temperature  that  bitterness  value  is  2 instead 
of  1 (or  3 instead  of  2,  or  4 instead  of  3,  or  5 instead  of  4)  is  estimated  to  be 
exp(2  * ( — 1.149))  = 0.10  times  the  odds  for  the  high  temperature.  To  parallel  the 
cumulative  logit  results,  the  odds  of  1 instead  of  2 is  exp(2  * 1.149)  = 9.95  times 
greater  for  the  low  than  the  high  temperature.  The  estimated  odds  are  greater  for 
the  cumulative  logit  model  (e.g.  21.6  for  temperature)  as  that  link  refers  to  the  entire 
response  scale.  The  estimated  standard  deviation  for  the  random  effect  is  0.84  for  the 
adaptive  quadrature  procedure  with  similar  results  for  the  REPL  procedure.  Again 
we  see  that  effects  and  standard  errors  are  larger  than  the  corresponding  fixed  effects 
estimates  when  the  heterogeneity  of  the  judges  is  taken  into  account. 

3.6.2  Developmental  Toxicity  Data 

Table  3.3  displays  the  results  of  a toxicity  study  involving  ethylene  glycol.  Such 
experiments  are  used  to  test  and  regulate  substances  which  may  pose  potential  danger 
to  developing  fetuses.  In  this  experiment  pregnant  mice  were  randomly  assigned 
to  one  of  four  dosages  groups  (0,  0.75,  1.50,  3.00  g/kg)  of  ethylene  glycol.  Each 
group  of  mice  was  exposed  to  the  ethylene  glycol  concentration  and  then  their  fetuses 
were  examined  for  defects.  Each  fetus  was  classified  as  either  Dead/Resorption, 
Malformed,  or  Normal.  The  continuation-ratio  logit  is  a natural  link  function  for  this 
type  of  response  as  obtaining  a given  classification  is  dependent  on  passing  through 
the  prior  classifications.  A random  effects  approach  is  also  warranted  for  this  type  of 


101 


Table  3.7:  Parameter  estimates  for  fitting  model  (3.87)  using  the  adjacent-category 
logit  model.  Numbers  in  column  labels  denote  the  number  of  quadrature  points 
used  in  the  adaptive  Gauss-Hermite  (AGH)  algorithm.  REPL  refers  to  the  restricted 
pseudo-likelihood  algorithm  and  FIXED  refers  to  the  fixed  effects  model  obtained  by 
omitting  the  random  effect. 


FIXED 

AGH(8) 

REPL 

Oil 

0.026 

-0.009 

-0.001 

OL2 

-0.738 

-1.043 

-0.953 

OiZ 

-1.326 

-1.880 

-1.716 

aq 

-1.461 

-2.250 

-2.017 

Pte 

-0.845 

(.205) 

-1.149 

(.268) 

-1.058 

(.234) 

Pco 

-0.483 

(.162) 

-0.659 

(.203) 

-0.607 

(.184) 

Pbo 

-0.041 

(.143) 

-0.056 

(.167) 

-0.051 

(.160) 

a 

- 

0.839 

(0.193) 

0.832 

LL 

-85.603 

-80.853 

- 

102 


data  as  fetuses  within  the  same  litter  are  likely  to  be  correlated  due  to  common  genetic 
factors  passed  on  by  the  dam.  For  instance,  some  dams  may  be  more  susceptible  to 
the  ethylene  glycol  which  would  then  be  inherited  by  the  fetuses. 

Coull  and  Agresti  (2000)  analyzed  this  dataset  using  a multivariate  binomial  logit- 
normal  (BLN)  model.  The  multivariate  BLN  model  is  a generalization  of  the  logit- 
normal  model  that  allows  one  to  model  the  correlation  structure  among  a set  of 
binomial  response  variables.  In  contrast  to  the  logit-normal  model  which  allows  only 
positive  correlations  among  observations  from  the  same  cluster,  the  multivariate  BLN 
model  allows  for  a wide  variety  of  covariance  structures  for  modeling  the  observations 
within  a cluster.  Coull  and  Agresti  (2000)  exploited  the  fact  that  the  multinomial 
mass  function  can  be  factored  into  a product  of  binomial  mass  functions.  Thus  the 
multinomial  response  for  a model  using  the  continuation-ratio  logit  can  be  treated 
a set  of  independent  binomial  counts.  This  allowed  them  to  fit  a continuation-ratio 
logit  random  effects  model  using  their  multivariate  BLN  model.  To  evaluate  the 
intractable  integrals,  Coull  and  Agresti  (2000)  employed  Gauss-Hermite  quadrature 
with  20  quadrature  points.  We  fit  similar  models  from  the  framework  of  a multinomial 
random  effects  model  using  adaptive  Gauss-Hermite  quadrature. 

We  begin  by  fitting  a simple  model  that  assumes  a common  dosage  effect  for  the 
two  logits.  The  two  logits  model  the  probability  of  a dead/resorbed  fetus  and  the 
conditional  probability  of  a malformed  fetus  given  the  fetus  was  alive.  To  account  for 
the  possibility  of  litter  effects,  we  include  a litter-specific  random  effect  which  allows 
the  logits  for  each  litter  to  be  shifted.  The  linear  predictor  for  the  ith  litter  has  the 
form 


rjir  = aT  + j3DO  Xi  + u{,  r = 1,  • • • , R - 1,  i = l,---,n, 


(3.88) 


103 


where  R = 3,  n = 94,  and  /3do  denotes  the  parameter  for  the  dosage  predictor. 
Table  3.8  displays  the  results  for  this  model  using  the  adaptive  quadrature,  auto- 
mated MCEM,  and  REPL  algorithms.  The  adaptive  quadrature  algorithm  required 
12  quadrature  points  and  converged  in  less  than  30  seconds.  Starting  from  the  REPL 
estimates,  the  automated  Monte  Carlo  EM  algorithm  required  43  minutes  to  converge 
with  a final  simulation  sample  size  of  2,361.  The  REPL  algorithm  required  less  than 
30  seconds  to  converge.  Interpreting  the  adaptive  quadrature  results,  the  estimated 
odds  of  death  and  the  estimated  odds  of  malformation  given  survival  are  multiplied 
by  exp(1.303)  = 3.68  for  every  additional  g/kg  of  ethylene  glycol.  Accounting  for 
the  heterogeneity  of  the  litters  increased  both  the  dosage  effect  size  and  its  standard 
error  when  compared  with  the  fixed  effects  analysis.  The  variation  among  the  lit- 
ters was  fairly  large  with  an  estimated  standard  deviation  of  1.18.  The  Monte  Carlo 
EM  algorithm  provided  similar  estimates  to  the  adaptive  quadrature  approach.  The 
approximate  REPL  algorithm  tended  to  underestimate  the  parameters.  This  may 
be  due  to  the  large  number  of  clusters  (94).  The  estimating  equations  in  the  REPL 
procedure  simultaneously  estimate  the  fixed  effect  parameters  and  the  random  effect 
parameters.  Such  a large  number  of  random  effect  parameters  may  adversely  effect 
the  fixed  effect  parameter  estimates. 

Model  (3.88)  makes  the  strong  assumption  that  the  ethylene  glycol  dosage  influ- 
ences the  probability  of  death  and  the  conditional  probability  of  malformation  given 
survival  in  the  same  manner.  This  assumption  can  be  relaxed  by  allowing  a sepa- 
rate dosage  effect  parameter  for  each  logit.  The  shifting  of  thresholds  assumption  in 
model  (3.88)  can  also  be  relaxed  by  allowing  the  probability  of  death  and  the  condi- 
tional probability  of  malformation  given  survival  to  vary  individually.  The  variation 
in  probabilities  may  be,  for  example,  due  to  differing  susceptibilities  of  the  mice  to 
the  dosages.  This  is  accomplished  by  replacing  rq  in  (3.88)  with  a litter-logit-specific 


104 


Table  3.8:  Parameter  estimates  for  fitting  model  (3.88)  using  the  continuation-ratio 
logit  link.  Numbers  in  column  labels  denote  the  number  of  quadrature  points  used 
in  the  adaptive  Gauss-Hermite  (AGH)  algorithm.  MCEM^  denotes  the  automated 
Monte  Carlo  EM  algorithm,  while  REPL  refers  to  the  restricted  pseudo-likelihood 
algorithm.  The  fixed  effects  results  (FIXED)  were  obtained  by  fitting  model  (3.88) 
without  a random  effect. 


FIXED 

AGH(12) 

MCEM^ 

REPL 

«1 

«2 

-5.852 

-2.707 

-7.020 

-3.398 

-7.008 

-3.392 

-6.450 

-3.052 

Pdo 

1.041 

(.073) 

1.303 

(.135) 

1.301 

(.135) 

1.176 

(.134) 

a 

- 

1.175 

(0.153) 

1.157 

1.058 

LL 

-539.765 

-494.528 

- 

- 

random  effect  uir  yielding  the  model 


Tjir  — OLr  T f^DOr  T ^ ir 


(3.89) 


This  new  model  has  a multivariate  random  effect  u*  = (un,  u &)'  which  we  assume  to 


be  multivariate  normal  with  zero  mean  and  covariance  matrix  X = 


o r 


pO\Oi 


po icr2 


In  addition  to  the  fixed  effects  version  of  model  (3.89),  Table  3.9  contains  the  resul 
for  three  variations  of  model  (3.89)  using  the  adaptive  quadrature  algorithm.  In 
the  second  column  of  results  we  fit  model  (3.89)  allowing  only  shifted  thresholds, 
as  in  model  (3.88),  with  eq  = cr2  and  p = 1 . We  then  allow  for  separate  random 
effects  for  each  logit,  but  assume  that  the  random  effects  are  independent.  That 


is  E = 


<t,2  0 


0 


of  results.  For  t 


. The  estimates  from  this  model  are  found  in  the  third  column 
le  final  column  of  results  we  allow  the  random  logit  effects  to  be 


correlated. 


105 


From  all  four  models  it  is  obvious  that  there  is  a dosage  effect  on  malformation 
given  survival  but  not  on  death.  The  models  allowing  varying  logits  provide  very 
similar  results.  One  could  perform  a likelihood-ratio  test  for  comparing  these  two 
models  with  a null  hypothesis  of  H0  : p = 0.  One  would  conclude  that  the  model 
without  the  correlation  is  sufficient  as  the  likelihood-ratio  test  statistic  is  only  .022. 
Comparisons  of  the  shifted  threshold  model  with  the  varying  threshold  models  would 
require  a score  test  as  the  null  hypothesis  model  would  contain  parameters  on  the 
boundary  of  the  parameter  space  (erf  = 0).  We  consider  such  a test  in  Chapter  5. 
From  the  log-likelihood  values  alone,  one  would  conclude  that  the  varying  threshold 
model  without  correlation  is  the  most  appropriate  model.  It  is  evident  from  the 
standard  deviation  estimates  for  this  model  that  the  litter  effect  is  much  stronger 
for  the  malformation  given  survival  outcome  than  the  death  outcome.  Coull  and 
Agresti  (2000)  considered  a number  of  other  models  that  allowed  different  covariance 
structures  for  each  dosage  group.  They  also  concluded  that  the  varying  threshold 
model  without  correlation  described  the  data  well. 


3.6.3  1975  General  Household  Survey  Life  Satisfaction  Data 

The  final  dataset,  Table  3.4,  comes  from  the  1975  U.S.  General  Household  Sur- 
vey. A total  of  1,472  subjects  were  asked  to  rate  their  degree  of  life  satisfaction 
with  their  family,  hobbies,  and  residence  using  a three-point  scale  (1  = Low,  2 = 
Medium,  3 = High).  A number  of  different  approaches  have  been  used  to  analyzed 
this  particular  dataset,  such  as  a 3-class  latent  variable  model  Clogg  (1979)  and  a 
latent  trait  model  Masters  (1985),  which  are  summarized  in  Bartholomew  (1987). 
Recently,  Hedeker  (2000)  re-analyzed  the  survey  data  using  a baseline-category  logit 

random  effects  model.  A nominal  logistic  regression  model  expresses  the  responses  in 

r . P{High\  P{Med} 

terms  of  two  logits,  namely,  log  — V- ^ and  log  — 7- f.  Hedeker  (2000)  demon- 

P{Low\  P{Low } 

strated  his  modeling  approach  by  fitting  a number  of  models  similar  to  those  found  in 
Bartholomew  (1987).  As  noted  before,  however,  his  model  does  not  allow  estimation 


106 


Table  3.9:  Parameter  estimates  for  various  fits  of  the  continuation-ratio  logit  model 
(3.89)  to  the  toxicity  dataset  using  the  adaptive  Gauss-Hermite  algorithm.  Q.  PTS. 
denotes  the  number  of  quadrature  points  used. 


FIXED 

SHIFTED 

LOGITS 

VARYING  LOGITS 
P = o p?  o 

Q.  PTS. 

- 

12 

18 

18 

ai 

-4.058 

-4.525 

-4.196 

-4.198 

012 

-2.949 

-3.911 

-4.360 

-4.356 

PdOi 

0.094 

-0.131 

0.083 

0.083 

(.200) 

(.205) 

(.217) 

(.217) 

Pdo2 

1.179 

1.588 

1.781 

1.780 

(.080) 

(.160) 

(.220) 

(.219) 

°\ 

- 

1.340 

0.559 

0.559 

02 

- 

- 

1.586 

1.587 

P 

— 

— 

0.000 

0.080 

LL 

-526.908 

-473.977 

-464.744 

-464.733 

of  correlations  between  random  effects  in  different  thresholds.  We  analyze  Table  3.4 
using  similar  models  to  that  of  Hedeker  (2000)  but  allowing  for  correlated  random 
effects  between  thresholds. 

Table  3.4  contains  1,472  subjects  or  clusters.  The  exact  maximum  likelihood 
algorithms  in  Section  3.3  are  carried  out  by  numerically  (by  quadrature  or  sampling 
methods)  evaluating  the  integrals  for  each  cluster.  To  directly  fit  Table  3.4  using  this 
approach  would  take  an  exorbitant  amount  of  time,  especially  when  including  multiple 
random  effects.  One  can  cleverly  “reduce”  the  number  of  clusters  by  noting  that  there 
are  only  33  = 27  distinct  response  patterns  possible  from  the  survey.  All  subjects  with 
the  same  response  pattern,  e.g.  the  15  subjects  with  response  111,  will  contribute 
the  same  amount  to  the  calculation  of  the  marginal  log-likelihood.  Thus,  within  a 
given  iteration,  the  numerical  approximation  of  the  integrals  only  needs  to  be  carried 
out  for  the  27  distinct  response  profiles.  The  27  contributions  to  the  marginal  log- 
likelihood  are  then  multiplied  by  the  appropriate  numbers  of  subjects  who  responded 


107 


in  that  manner.  Such  a modification  is  straightforward  in  the  adaptive  Gauss-Hermite 
quadrature  algorithm  as  the  marginal  log-likelihood  is  directly  maximized.  Thus,  the 
results  given  below  were  obtain  using  adaptive  quadrature.  This  also  allows  us  to 
compare  the  direct  Gauss-Hermite  maximization  algorithm  of  Hedeker  (2000)  to  our 
direct  adaptive  Gauss-Hermite  algorithm. 

To  analyze  Table  3.4  we  fit  a series  of  models  that  are  similar  to  those  used  in 
the  continuation-ratio  logit  example.  For  the  jth  question  of  the  fth  subject,  let 
Xiji,  xij2,  and  x^  be  indicator  variables  for  the  family,  hobbies,  and  residence  items, 
respectively,  such  that  x^i  = 1 if  j = l,  j,  l = 1,  • • • , 3,  and  zero  otherwise.  As  in 
Hedeker  (2000),  we  consider  models  that  allow  separate  item  parameters  for  each 
logit,  which  we  denote,  for  the  rth  logit,  by  /3pr,  /?#,.,  and  Prt  for  the  family,  hobbies, 
and  residence  items,  respectively.  We  first  fit  a simple  fixed  effects  model 

Vijr  = 0Fr  Xw  + far  Xij2  + 0Rr  Xij3,  r = 1,2,  n = 1,  • • • , 1472.  (3.90) 

Note  that  in  order  to  directly  estimate  the  three  item  parameters  for  each  logit,  we 
removed  the  threshold  parameters  { aT  } found  in  the  original  definition  of  the  baseline- 
category  logit  model  (see  Section  2.4.1).  Model  (3.90)  is  unrealistic  as  responses  from 
the  same  subject  are  certain  to  be  correlated.  We  then  allowed  for  a shift  in  thresholds 
by  including  a subject-specific  intercept  term.  This  model  assumes  that  the  random 
subject-effect  influences  the  logits  High  versus  Low  and  Medium  versus  low  in  the 
same  manner.  Such  an  assumption  is  inappropriate  for  the  baseline-category  logit 
model  since  the  nominal  responses  need  not  have  any  relation  to  each  other.  Thus 
one  should  not  expect  the  subject-specific  effect  to  remain  the  same  across  all  logits. 
More  realistically,  one  could  allow  for  varying  subject  effects  for  the  two  logits,  as 
was  done  in  the  continuation-ratio  logit  model.  This  model  has  the  form 


V/jr  f^Fr  %ijl  T //,.  X'i‘j2  T f X-ij2  T Ujr, 


(3.91) 


108 


where  Uir  is  a threshold-specific  random  effect.  Recall  that  the  first  logit  compares 
Medium  versus  Low  while  the  second  compares  High  versus  Low. 

The  specification  of  model  (3.91)  is  not  complete  until  we  specify  the  structure  of 


the  covariance  matrix  £ = 


012 


for  the  random  effects  vector  u;  = - 


0d2  o\ 

In  the  baseline-category  logit  random  effects  model  proposed  by  Hedeker  (2000),  the 
covariance  term  is  forced  to  be  \J  o\  o\  so  that  the  correlation  p is  equal  to  one.  This 
assumption  is  a consequence  of  the  estimation  algorithm  he  used.  His  general  model 
has  the  form 


Vijr  = z'ijPr  + Wyllfr, 

where  Zy  and  Wy  are  the  fixed  and  random  effects  design  vectors,  respectively,  /3r 
are  the  q fixed  effects  parameter  vectors,  and  uir  is  the  cluster-logit-specific  random 
effects  vector  for  the  rth  logit.  Additionally  he  assumed  that  the  random  effects  vector 
uir  followed  a multivariate  normal  distribution  with  mean  0 and  covariance  matrix 
£r-  Note  that  the  covariance  matrix  is  allowed  to  vary  at  the  logit  level  r.  To  simplify 
the  use  of  multivariate  Gauss-Hermite  quadrature,  Hedeker  (2000)  standardized  the 
random  effects  by  letting  uir  = where  £^2  is  a matrix  containing  the  elements 

of  the  lower  Cholesky  square  root  of  £ and  0;  is  a multivariate  standard  normal 
random  variable.  The  resulting  model  has  the  form 

Vijr  = z'ijP,  + Wy£j/20i. 

To  estimate  the  regression  parameter  vectors  f3r  and  the  Cholesky  elements  of  £r/2, 
Hedeker  (2000)  used  a Fisher  scoring  algorithm  on  the  marginal  likelihood  obtained 
through  Gauss-Hermite  quadrature. 

Since  the  covariance  matrix  £r  is  defined  at  the  logit  level,  the  distribution  of  the 
random  effects  for  cluster  i has  a block  diagonal  covariance  matrix  £ = [£r]  with 


109 


zeros  above  and  below  the  block  diagonal.  Thus  the  random  effects  terms  between 
logits  are  assumed  to  be  independent.  However,  in  the  Fortran  program  MIXNO 
(Mixed-effects  Nominal  Logistic  Regression)  that  Hedeker  (2000)  made  available  to 
implement  his  methods,  this  is  not  how  the  model  is  fit.  In  MIXNO  the  random  effects 
vector  and,  in  turn,  the  covariance  matrix  are  defined  at  the  cluster  level  as  in  our 
model.  Thus,  for  this  example,  the  MIXNO  model  allowing  a shifting  in  thresholds 
would  be 


Vijr  pFr  3Cijl  T b//r  %ij2  T (3rt  ij 3 T F'0l , 

where  is  a standard  normal  random  variable.  To  allow  for  varying  thresholds, 
however,  MIXNO  actually  fits,  in  contrast  to  (3.91),  the  model 

Vijr  @Fr  ij  1 T Pht  ij2  T @Rr  % ij3  T G\0i I[t  = 1]  T 02 9il\r  = 2], 

where  I[r  = l],  l = 1,2,  is  an  indicator  function  which  is  one  if  r = l and  zero 
otherwise.  Note  that  the  two  random  effects,  uix  and  ui2,  in  (3.91),  have  been  re- 
placed by  a single  random  effect  Thus,  for  our  varying  threshold  model  (3.91) 
to  be  equivalent  with  that  fit  by  MIXNO,  we  would  have  to  assume  that  un  and 
ui2  are  perfectly  (positively)  correlated.  For  the  *th  subject,  let  (fa  ,fi2)  be  the  two 
realizations  of  the  random  threshold  effects  for  model  (3.91).  The  realization  £ir  is 
the  amount  that  the  threshold  for  the  *th  subject  is  perturbed  from  the  overall  mean 
threshold  ar,  *=1,2.  The  assumption  of  perfect  correlation  between  the  threshold 
random  effects  implies  that  for  all  subjects  the  realizations  (^1,^2),  * = 1,  • ■ • , n,  lie 
on  a line  in  R2,  with  positive  slope.  Thus,  the  thresholds  for  a given  subject  can  vary 
from  the  overall  mean  of  the  thresholds,  but  the  amounts  that  they  are  perturbed 
will  be  linearly  related  to  the  perturbations  for  all  other  subjects.  As  the  amounts 
that  the  thresholds  are  perturbed  are  related  to  a subject’s  perception  of  utility  for 
the  choices,  it  seems  inappropriate  to  assume  that  the  perception  of  utility  across  the 


110 


choices  for  all  subjects  will  follow  such  a perfect  relationship.  Indeed,  if  the  nominal 
responses  referred  to  choice  of  political  party  (i.e.,  democrat,  republican,  indepen- 
dent), one  would  expect  quite  different  utility  perceptions  for  subjects  with  differing 
political  ideologies. 

The  same  approach  is  used  by  MIXNO  to  allow  random  covariate  effects  to  vary 
across  logits.  The  reason  that  this  approach  is  used  is  that  it  greatly  reduces  the 
number  of  integrals  that  need  to  be  approximated.  Instead  of,  in  general,  having  a 
set  of  q integrals  for  each  cluster,  there  is  only  one  set  of  integrals  to  approximate. 
The  consequence  of  this  approach,  however,  is  that  random  effects  between  logits  are 
always  perfectly  correlated.  In  Table  3.10  are  the  results  of  fitting  the  fixed  effects 
model  (3.90),  and  various  version  of  the  random  effects  model  (3.91)  with  our  method 
and  that  proposed  by  Hedeker  (2000). 

For  all  models  in  Table  3.10,  including  the  fixed  effects  model,  the  item  fixed  effects 
parameters  are  highly  significant  indicating  the  increased  probability  of  medium  or 
high  satisfaction  responses,  relative  to  low  satisfaction,  for  all  items.  Allowing  for 
a shift  in  thresholds  increased  both  the  standard  errors  and  effects  sizes,  with  the 
estimated  standard  deviation  of  the  random  effect  being  1.14.  This  trend  held  for 
the  other  models  as  well,  except  for  the  varying  threshold  model  with  the  correlation 
fixed  at  zero.  We  included  this  model  as  it  coincides  with  the  original  model  definition 
given  by  Hedeker  (2000).  Though  the  log-likelihood  value  for  this  model  is  smaller 
than  the  shifted  threshold  model,  the  parameter  estimates  are  quite  different  from 
the  remaining  varying  threshold  models.  In  fact,  they  are  very  similar  to  those 
obtained  for  the  fixed  effects  model,  especially  for  the  first  logit  comparing  medium 
versus  low.  The  second  to  last  column  contains  the  results  for  the  varying  threshold 
model  with  the  correlation  fixed  at  1.0.  These  results  were  obtained  using  MIXNO 
and  are  reported  in  Hedeker  (2000).  Comparing  these  estimates  to  the  final  column 
where  the  correlation  was  estimated,  we  see  that  they  are  in  better  agreement  than 


Ill 


Table  3.10:  Parameter  estimates  for  various  fits  of  the  baseline-category  logit  model 
(3.91)  to  the  life  satisfaction  dataset  using  the  adaptive  Gauss-Hermite  algorithm 
(AGH)  and  the  Gauss-Hermite  algorithm  (GH)  of  Hedeker  (2000).  The  numbers  in 
the  column  labels  denote  the  number  of  quadrature  points  used. 


FIXED 

SHIFTED 

LOGITS 

VARYING  LOGITS 
P = 0 p = 1 p = p 

AGH(13) 

AGH(15) 

GH(20) 

AGH(15) 

Pfx 

1.040 

(.124) 

1.572 

(.161) 

1.004 

(.128) 

1.327 

(.161) 

1.384 

(.169) 

Phx 

0.679 

(.084) 

1.083 

(.116) 

0.658 

(.074) 

0.882 

(.110) 

0.933 

(.119) 

Prx 

0.890 

(.082) 

1.295 

(.114) 

0.880 

(.084) 

1.074 

(.104) 

1.144 

(.115) 

Pf2 

2.557 

(.111) 

3.089 

(.151) 

2.949 

(.127) 

3.166 

(.150) 

3.264 

(.161) 

Ph2 

1.371 

(.077) 

1.775 

(.111) 

1.477 

(.091) 

1.615 

(.110) 

1.709 

(.118) 

Pr2 

1.256 

(.078) 

1.661 

(.112) 

1.276 

(.091) 

1.403 

(.109) 

1.509 

(.118) 

0\ 

- 

1.142 

0.442 

0.333 

0.832 

02 

- 

- 

1.320 

1.582 

1.626 

P 

— 

— 

0.000 

1.000 

0.617 

LL 

-3854.96 

-3828.93 

-3744.66 

-3742.23 

-3736.93 

112 


the  model  with  p = 0.  Thus  allowing  for  some  correlation,  albeit  a correlation  of 
1.0,  does  provide  estimates  that  are  closer  to  the  most  general  model.  However,  we 
do  see  some  changes  in  the  parameter  estimates  when  the  correlation  is  allowed  to 
be  estimated.  For  one,  we  see  that  the  estimated  standard  deviation  for  the  first 
logit  (.832)  is  more  than  doubled  than  when  p = 1 (.333).  Also,  for  all  of  the  item 
parameters,  the  effect  sizes  and  standard  errors  are  larger. 

In  practice,  on  would  be  interested  in  comparing  the  estimated  item  parame- 
ters between  the  two  logits.  For  example,  for  the  varying  threshold  model  allowing 
correlation  between  thresholds,  the  odds  of  a subject  having  medium  versus  low 
satisfaction  in  their  family  is  exp(1.384  — 0.933)  = 1.6  times  greater  than  having 
medium  versus  low  satisfaction  in  their  hobbies.  In  contrast,  the  odds  for  a given 
subject  of  choosing  high  versus  low  satisfaction  for  family  is  exp(3.264  - 1.709)  = 4.7 
times  greater  than  choosing  high  versus  low  satisfaction  for  hobbies.  A similar  pat- 
tern occurs  when  comparing  family  satisfaction  versus  residence  satisfaction  (odds  of 
exp(l. 384  — 1.144)  = 1.3  and  exp(3. 264  — 1.509)  = 5.8  for  medium  versus  low  and  high 
versus  low,  respectively).  The  comparison  of  satisfaction  with  hobbies  and  satisfaction 
with  residence  yields  similar  odds  for  medium  versus  low  (exp(0.933  — 1.144)  = 0.8) 
and  high  versus  low  (exp(1.709  - 1.509)  = 1.2).  Thus  one  might  consider  fitting  a 
simpler  model  which  constrains  the  hobbies  and  residence  parameters  to  be  the  same 
within  each  logit  (i.e.,  fiHr  = /3Rr,  r = 1,  2).  In  summary,  the  main  effect  observed  is 
a greater  tendency  for  high  satisfaction  with  families  than  with  hobbies  or  residence. 

3.7  Cumulative  Logit  Models  with  Random  Thresholds 
Recall  the  wine  tasting  dataset  of  Table  3.2.  To  account  for  possible  differences  in 
sensitivity  to  wine  bitterness  among  the  judges,  we  analyzed  the  data  in  Section  3.6.1 
with  a cumulative  logit  model  that  allowed  the  thresholds  to  be  randomly  shifted 
for  each  judge.  An  underlying  assumption  in  the  shifted  threshold  model  is  that  all 
thresholds  in  a given  cluster  are  shifted  by  the  same  amount.  This  assumption  also 


113 


implies  that  the  distance  between  thresholds  remains  the  same  across  all  clusters. 
As  an  example,  the  estimated  thresholds  for  the  wine  dataset  were  a4  = —4.082, 
a2  = —-930,  Q!3  = 1.797,  and  a4  = 3.657  (Table  3.5,  AGH(5)  results).  These  esti- 
mates represent  the  average  threshold  estimates  across  all  judges.  The  judge-specific 
thresholds  are  shifted  from  the  average  thresholds  by  the  predicted  judge  effects  given 
in  Table  3.6.  For  each  judge,  however,  the  distance  between  thresholds  will  remain 
&4  — CK3  = 1.860,  a3  — a2  = 2.727,  and  a2  — aq  = 3.152.  Tutz  and  Hennevogl  (1996) 
proposed  a cumulative  logit  random  effects  model  which  relaxed  this  condition.  For 
a given  cluster,  they  assumed  that  each  threshold  was  allowed  to  vary  according  to 
its  own  distribution.  Thus  the  distance  between  thresholds  could  vary  across  judges. 

We  have  already  considered  models  in  the  previous  section  in  which  each  threshold 
was  allowed  to  vary  according  to  its  own  distribution,  using  baseline-category  and 
continuation-ratio  logits.  Applying  this  approach  to  the  cumulative  logit  model  is 
more  difficult,  however,  due  to  the  required  ordering  of  the  thresholds.  Recall  that 
for  the  cumulative  logit  model,  the  thresholds  are  ordered  such  that  ax  < ■ ■ ■ < otq. 
For  the  random  threshold  model  the  restriction  becomes 


CTl  -f-  V>i\  ^ ^ Otq  T Ujiq1 

where  u,  = (zt*i,  - - - , uiq)  is  assumed  to  be  multivariate  normal  with  mean  0 and 
covariance  matrix  £a.  Unless  the  thresholds  are  well  separated  or  the  diagonal  el- 
ements of  £q  are  small,  this  ordering  is  likely  to  be  violated  during  the  estimation 
routine.  Indeed,  P(a  x + Un  < ■■■  < aq  + uiq)  < 1 unless  corr(uir,  uiri)  = 1 for 
r ^ t1  (Tutz  and  Hennevogl  1996).  There  are  two  approaches  that  can  be  used  to 
avoid  such  violations.  First,  one  could  use  a constrained  maximization  routine  which 
would  enforce  the  restriction  while  carrying  out  the  maximization.  Such  algorithms 
are  difficult  to  program,  especially  when  using  quadrature  or  Monte  Carlo  methods. 
The  second  approach  is  to  transform  the  thresholds  to  a new  set  of  parameters  which 


114 


are  not  restricted.  Maximization  is  then  carried  out  using  the  transformed  thresholds. 
Such  an  approach  was  utilized  by  Tutz  and  Hennevogl  (1996).  To  incorporate  the 
reparameterized  thresholds  into  the  multivariate  generalized  linear  model  framework, 
new  response  functions  and  design  matrices  must  be  defined.  In  the  next  section  we 
present  the  extended  random  effects  model  and  outline  the  necessary  modifications  to 
the  fitting  algorithms.  We  then  illustrate  the  extended  model  in  Section  3.7.2  using 
the  wine  dataset  analyzed  previously  in  Section  3.6.1.  In  Section  3.7.3  we  examine 
the  extended  model  in  more  detail  and  discuss  some  of  the  problems  that  we  encoun- 
tered while  using  it.  We  conclude  in  Section  3.7.4  with  a small  simulation  study  that 
examines  the  bias  in  regression  parameters  when  one  fits  the  simpler  shifted  threshold 
model  in  lieu  of  the  extended  model. 

3.7.1  Extension  to  Random  Thresholds 

To  avoid  violation  of  the  threshold  ordering,  Tutz  and  Hennevogl  (1996)  proposed 
the  following  reparameterization,  for  which  the  model  with  random  thresholds  is  more 
appropriate: 

“i=«i>  ar  = Iog(ar  - ar_i),  r = 2,  •••,(?,  (3.92) 

or  equivalently 

r 

OLi=ax,  ar  = ai  + ^exp(dr),  r — 

i= 2 

Under  this  parameterization,  the  new  thresholds  cH,  • • • ,aq  are  unrestricted  with  pa- 
rameter space  (cii,  • • ■ , aq)  £ R9.  Note  that  dq  is  equivalent  to  aq,  while  the  remaining 
reparameterized  thresholds  measure  the  distance  between  the  original  thresholds. 

Because  of  the  reparameterized  thresholds,  the  response  functions  and  design 
matrices  given  in  Section  2.4.3  for  the  cumulative  logit  link  are  no  longer  valid.  In 
particular,  a new  response  function  must  be  defined  so  that  the  usual  form  of  the 
multivariate  generalized  linear  model  is  obtained.  For  the  random  threshold  model 


115 


the  new  response  function  has  the  form  h(*7y)  = • • • , with 

1 


hiiVij)  = 
= 


1 + exp(-r/ijl) 
1 


1 


1 + exp(— 77, j!  — 53  eT,,j')  1 -(-  exp(— rjiji  — e^o') 

*=2  1=2 

where  the  linear  predictor  77^  = (77^1,  • • • , jjy,)'  is  given  by 


(3.93) 

r = 2 


Viji  = «1  + Xy/3  + ua, 

Vijr  — dr  + Uiri  T — 2,  • • • , (J.  (3.94) 


Equivalently,  the  new  link  function  g{i r^)  = (51(717,),  • • • , gq(-Ki3))'  is  given  by 


9i(*ij)  = log  ( - 1X131  \ , 

V 1 - TT.jl  / 


0r(*Ty)  = log 


log 


X>w  A 

1=1 


V 


- log 


7 


E "■«!  N 

/=! 

r — 1 

1 ~ 7r ijl 


r = 2,...,q.  (3.95) 


1 Xy  11  ijl 

(=1 

To  accommodate  the  new  link  function,  the  design  matrix  now  takes  the  form 


Zij  — 


1 0 


(3.96) 


Using  (3.95)  and  (3.96),  the  random  threshold  model  has  the  form  g(7Ty)  = Zi3(3  + \ii 
where  /3  = (dq,  • • • , aq,  7 ) , u*  = (uix,  • • • , uiq)  , and  7 is  the  fixed  effects  parameter 
vector.  For  the  more  general  model  that  allows  both  threshold  random  effects  and 
cluster-specific  random  effects  wy,  the  model  takes  the  usual  form  g(?r y)  = Zy/9  + 


116 


WijUi  where  the  random  effects  design  matrix  has  the  form 


w 


u 


Wii 


1 0 

and  u 

i rs-,  MVN{  0,£). 

The  algorithms  given  in  Section  3.3  and  3.5  can  be  used  to  fit  the  random  thresh- 
olds model  by  utilizing  the  new  response  function  (3.93)  and  the  design  matrix 
(3.96).  In  addition,  the  first  and  second  derivatives  of  the  response  function  with 
respect  to  the  linear  predictor  must  also  be  updated.  Let 


and  T, 


1 


1 + exp(-7ftjx) 


ijr 


1 + exp {-ruji  - EI=2  expfa,-,))  ’ r 2’ ' ‘ Also  let  cb>  ~ exP(%>), 


r = 


— 2,  • • • ,q,  with  Ciji  = 1.  The  first  derivative  matrix  Dij 


element  of 


dh(Vij) 

drlij 


has  a ( u , v)th 


duv  — 0 if  u > v, 

Ppu  ( 1 P pt; ) (‘ijv  if  U — V, 

Tjjj,(l  Tjj„)Cjju  Tjj„_x(l  Fij,v—l)Ciju  if  u <C  V. 

To  calculate  the  observed  information  matrix,  the  second  derivative  matrix  of  the 
response  function  with  respect  to  the  linear  predictor  is  required.  Formulas  for  the 
second  derivative  matrix  in  the  varying  threshold  model  are  very  complicated.  We 
programmed  such  formulas  for  the  application  considered  in  this  section  but  do  not 
provide  the  details.  As  an  alternative,  one  can  use  numerical  derivatives  to  calculate 
the  observed  information  matrix. 


117 


3.7.2  Application:  Wine  Tasting  Dataset 

To  illustrate  the  cumulative  logit  model  with  varying  thresholds,  Tutz  and  Hen- 
nevogl  (1996)  analyzed  the  wine  tasting  dataset  (Table  3.2)  using  their  Monte  Carlo 
EM  algorithm.  Recall  that  judges  were  asked  to  rate  the  bitterness  of  the  wine  on  a 
five-point  scale.  In  the  varying  threshold  model,  each  judge  is  allowed  to  have  differ- 
ing thresholds.  The  linear  predictor  for  the  }th  rating  from  the  zth  judge  has  the 
form 


Vijl  —Oil  + PtE  Xiji  + Pco  Xij2  + PbO  %ij3  + w»i,  (3.97) 

Vijr  oir  T Ujr,  r = 2,  • • • ,4, 

where  the  {by}  are  defined  as  in  (3.92)  and  u,  = (u^,---  ,ui4)  is  assumed  to  be 
multivariate  normal  with  mean  0 and  covariance  matrix  E.  The  regression  parameters 
and  covariates  in  (3.97)  are  defined  as  in  model  (3.87)  in  Section  3.6.1.  The  dimension 
of  the  random  effects  vector  tq  is  four,  thus  the  unstructured  covariance  matrix  E 
contains  ten  parameters:  four  variance  terms  and  six  covariance  terms.  One  might 
consider  reducing  the  number  of  parameters  in  E by,  for  example,  assuming  common 
correlations  between  the  random  effects.  However,  such  assumptions  are  typically 
not  possible  for  this  extended  model  which  we  will  discuss  in  more  detail  in  the  next 
section. 

Table  3.11  contains  the  regression  parameter  and  variance  component  estimates 
for  model  (3.97)  as  reported  by  Tutz  and  Hennevogl  (1996),  and  those  obtained  using 
our  adaptive  Gauss-Hermite  quadrature  algorithm.  We  have  also  included  the  results 
from  the  shifted  threshold  model  fit  in  Section  3.6.1.  We  note,  however,  that  the 
threshold  estimates  for  the  second,  third,  and  fourth  thresholds  of  this  model  are  not 
comparable  with  the  corresponding  thresholds  from  the  extended  models.  Due  to  the 
small  estimated  variance  components  and  large  correlations,  we  modified  our  adaptive 
algorithm  so  that  the  Cholesky  square  root  of  E was  estimated.  To  fit  model  (3.97) 


118 


we  initially  used  a small  number  of  quadrature  points  (5)  to  obtain  starting  values, 
and  then  refit  the  model  using  15  quadrature  points  in  each  dimension.  Thus,  for  each 
of  the  nine  judges  at  each  iteration,  154  = 50,625  quadrature  points  were  evaluated. 
The  algorithm  required  approximately  58  hours  to  converge.  We  then  ensured  that 
parameter  estimates  had  converged  to  four  decimal  places  by  refitting  the  model  using 
16  quadrature  points  in  each  dimension,  starting  from  the  final  parameter  estimates 
of  the  previous  fit. 

We  begin  by  examining  the  results  reported  by  Tutz  and  Hennevogl  (1996)  using 
the  EM  algorithm  with  10,  20,  and  30  Monte  Carlo  samples.  First  note  that  the 
estimates  are  quite  variable  across  the  three  simulation  sizes,  and  do  not  appear  to 
be  settling  down.  Even  the  log-likelihood  values  are  quite  different  from  the  results 
with  10  samples  (-78.100)  and  the  results  with  30  samples  (-80.437).  We  again  see 
extremely  large  estimates  for  the  standard  errors,  especially  with  the  simulation  size  of 
30.  We  assume  that  the  large  estimates  are  a result  of  the  error  in  their  programming 
that  was  noted  before.  The  large  variation  in  the  parameter  estimates  in  most  likely 
a result  of  using  Monte  Carlo  samples  sizes  that  are  too  small.  In  addition,  Tutz 
and  Hennevogl  (1996)  do  not  account  for  the  Monte  Carlo  error  in  the  numerical 
integration  which  would  propagate  through  to  the  parameter  estimates.  Tutz  and 
Hennevogl  (1996)  concluded,  by  comparison  of  log-likelihoods,  that  the  inclusion  of 
threshold  specific  random  effects  for  this  dataset  was  unnecessary. 

Comparing  the  results  of  Tutz  and  Hennevogl  (1996)  to  the  adaptive  algorithm  we 
see  a number  of  differences.  The  most  dramatic  difference  is  found  in  the  estimate  of 
the  standard  deviation  for  the  first  threshold  where  the  adaptive  algorithm  obtained 
1.843  while  the  Monte  Carlo  EM  algorithm  with  30  samples  obtained  2.942.  Since 
the  first  threshold  is  the  same  under  both  the  varying  threshold  parameterization  and 
the  shifted  threshold  model,  the  estimated  standard  deviation  for  the  first  threshold 
of  the  former  model  is  comparable  with  shifted  threshold  model.  Indeed,  Tutz  and 


119 


Table  3.11:  Parameter  and  variance  component  estimates,  and  log-likelihood  values 
(LL)  for  fitting  model  (3.97)  to  the  wine  tasting  dataset  using  the  cumulative  logit 
link.  Numbers  in  column  labels  denote  the  number  of  quadrature  points  or  Monte 
Carlo  samples  used  in  the  adaptive  Gauss-Hermite  (AGH)  and  Monte  Carlo  EM 
(MCEM)  (Tutz  and  Hennevogl  1996)  algorithms.  SHIFTED  denotes  the  results  ob- 
tained using  the  AGH  algorithm  with  five  quadrature  points  allowing  only  a shifting 
of  thresholds. 


SHIFTED 

AGH(15) 

MCEM(10) 

MCEM(20) 

MCEM(30) 

A l 

-4.082 

-4.661 

-6.817 

-4.230 

-4.916 

-0.930 

1.287 

1.632 

1.313 

1.410 

1.797 

0.995 

0.951 

0.860 

0.945 

a 4 

3.657 

0.560 

0.717 

0.529 

0.563 

Pte 

1.536 

1.554 

1.529 

1.546 

1.529 

(.298) 

(.302) 

(.628) 

(.938) 

(1.144) 

Pco 

0.916 

0.922 

0.943 

0.976 

0.971 

(.256) 

(.247) 

(.464) 

(.744) 

(.875) 

Pbo 

0.122 

0.123 

0.093 

0.093 

0.075 

(.232) 

(.234) 

(.424) 

(.644) 

(.764) 

0i 

1.145 

1.843 

3.206 

2.775 

2.942 

02 

- 

0.223 

0.412 

0.324 

0.353 

03 

- 

0.225 

0.400 

0.344 

0.425 

04 

— 

0.178 

0.141 

0.296 

0.154 

LL 

-81.394 

-80.898 

-78.100 

-78.895 

-80.437 

Hennevogl  (1996)  commented  that  the  varying  threshold  model  provided  a distinctly 
larger  estimate  for  this  threshold  standard  deviation  (2.942  versus  1.145).  However, 
we  see  with  the  adaptive  algorithm  that  the  estimates  are  not  as  markedly  different 
(1.843  versus  1.145).  We  do  see  slightly  larger  parameter  estimates  and  standard 
errors  for  the  adaptive  algorithm  when  compared  with  the  shifted  model,  but  in 
general  the  two  models  are  providing  very  similar  results.  Without  a formal  test  for 
comparing  the  two  models,  the  difference  in  the  log-likelihoods  would  certainly  suggest 
that  the  more  complicated  model  with  nine  additional  parameters  is  unneeded. 


120 


We  mentioned  before  that  we  needed  to  estimate  the  Cholesky  factor  of  E instead 
of  E for  this  dataset.  Due  to  the  parameterization  of  the  thresholds  given  in  (3.92), 
the  correlations  between  thresholds  can  be  extremely  high.  For  the  given  dataset,  the 
estimated  correlation  matrix  is  given  by 

1 -.951  -.135  -.588 
1 .438  .690 

1 -.724 

1 

Such  large  correlations  coupled  with  small  variance  component  estimates  can  cause 
problems  for  the  estimation  algorithm.  Use  of  the  Cholesky  form  of  the  covariance 
matrix  can  help  alleviate  some  of  these  problems. 

3.7.3  Discussion 

As  in  the  varying  threshold  models  using  the  baseline-category  and  continuation- 
ratio  logit  links,  the  extended  threshold  model  of  Tutz  and  Hennevogl  (1996)  provides 
a way  of  relaxing  the  shifted  threshold  assumption  for  the  cumulative  logit  link. 
Indeed  for  situations  like  the  wine  tasting  experiment  where  subjects  are  asked  to 
give  their  preferences  for  items,  one  might  expect  the  thresholds  to  vary  individually 
across  subjects.  In  contrast,  however,  to  the  varying  threshold  models  for  the  baseline- 
category  and  continuation-ratio  logit  links,  the  extended  threshold  model  of  Tutz  and 
Hennevogl  (1996)  does  not  have  the  same  interpretation  for  the  shifting  of  thresholds. 

Consider  a model  with  three  responses  and,  thus,  two  thresholds  (ori,  a2).  For  the 
previous  varying  threshold  models,  the  thresholds  were  allowed  to  vary  by  introducing 
random  effects  linearly  with  the  thresholds:  (a\  + un , a2  + u^),  where  u’  = (tin,  Ui2) 
was  assumed  to  be  multivariate  normal  with  mean  0 and  covariance  E.  Due  to  the 
ordering  of  thresholds  in  the  cumulative  logit  model,  random  effects  were  introduced 
linearly  with  the  reparameterized  thresholds:  (dx  + uiUa2  + ui2),  with  the  same 


121 


assumptions  on  the  random  effects.  Since  dx  = ax,  the  shifted  first  threshold  in  the 
extended  model  is  the  same  as  that  for  the  previous  varying  threshold  models.  For 
the  second  threshold,  however,  we  have 

0:2  + Ui 2 = log(a2  - di)  + ui2 

= log(a2  - dx)  + log(exp(«i2)) 

= log(a2  exp(ui2)  - dx  exp(ui2)). 

Thus  we  lose  the  simplistic  interpretation  of  shifting  for  the  second  threshold.  We 
also  see  that  dx  and  d2  may  be  highly  correlated  as  d2  contains  dx . This  was  already 
seen  for  the  wine  dataset  were  the  first  two  thresholds  had  an  estimated  correlation 
of -0.951. 

As  was  noted  for  the  wine  tasting  dataset,  and  from  our  experience  with  other 
datasets,  fitting  the  varying  threshold  model  can  be  difficult.  For  one,  it  is  often  dif- 
ficult to  find  starting  values  for  the  covariance  matrix  E,  especially  when  the  number 
of  thresholds  is  large.  The  reason  is  that  the  first  variance  component  measures  the 
variability  in  the  first  threshold,  while  the  remaining  measure  the  variability  in  the 
log  difference  between  adjacent  thresholds.  Thus,  the  variance  estimates  are  often 
quite  different  in  magnitude.  For  the  wine  tasting  dataset,  the  standard  deviation 
for  the  first  threshold  was  over  10  times  that  of  the  last  threshold.  Such  disparity 
in  the  estimates  can  also  cause  numerical  problems  in  the  estimating  routine.  For 
the  models  in  the  previous  sections,  covariance  (or  correlation)  terms  were  always  set 
at  zero  as  starting  values.  This  typically  can  not  be  done  for  the  varying  threshold 
model,  however,  as  the  random  effects  are  often  highly  correlated.  Indeed,  for  the 
wine  tasting  dataset  we  were  unable  to  even  evaluate  the  likelihood  until  we  had  cho- 
sen appropriate  covariance  terms.  To  reduce  the  number  of  parameters  and  possibly 
ease  the  estimation  problems,  one  could  assume  a structure  for  the  correlations  of  the 
random  thresholds.  Unfortunately,  the  correlations  often  vary  both  in  magnitude  and 


122 


sign  across  all  pairs  of  thresholds.  Thus  forcing  a set  structure  on  the  correlations  can 
lead  to  more  numerical  problems.  For  example,  we  tried  an  autoregressive  structure 
for  the  wine  dataset  with  heterogeneous  variances  for  the  thresholds  and  correlations 
that  declined  exponentially  with  distance.  That  is 

o\  p p 2 p 3 

P o\  p p2 
P2  P P 
P3  P2  P o\_ 

However,  we  were  unable  to  even  find  suitable  starting  values  to  start  the  estimation 
routine. 

3.7.4  Simulation  Study 

Due  to  the  difficulties  in  fitting  the  varying  threshold  model,  the  unexpected 
results  when  the  response  categories  are  reversed,  and  the  lack  of  the  software  for 
fitting  the  model,  one  would  most  likely,  in  practice,  fit  the  simpler  shifted  threshold 
model.  The  shifted  threshold  model  is  attractive  as  it  is  relatively  simple  to  fit 
and  has  a much  simpler  interpretation  in  terms  of  the  original  thresholds.  What 
is  not  known,  however,  is  how  well  the  shifted  threshold  model  performs  when  the 
varying  threshold  model  truly  holds.  To  examine  its  performance  one  should,  ideally, 
simulate  data  with  varying  thresholds,  fit  the  data  using  both  the  varying  threshold 
model  and  the  shifted  threshold  model,  and  then  compare  the  results.  Unfortunately, 
as  discussed  in  the  previous  section,  the  varying  threshold  model  can  be  very  unstable 
and  we  were  unable  to  consistently  fit  the  model  to  the  simulated  data.  Thus,  we 
carried  out  a simulation  study  using  only  the  shifted  threshold  model. 

The  goal  of  the  simulation  study  was  to  determine  the  bias  in  the  regression 
parameter  estimates  when  data  simulated  with  varying  thresholds  was  fit  using  the 


123 


shifted  threshold  model.  To  this  end  we  simulated  data  from  the  following  model 

Vijr  — T T (3.98) 

* = 1 >••■»«.  r = !,■■■  ,q  = R-l,  j = l,---,T, 

where  R = 3,  T = 7,  n = 100,  and  u-  = is  multivariate  normal  with  mean 

0 and  covariance  E.  The  covariate  values  {x^}  were  simulated  from  the  standard 
normal  distribution,  and  the  parameters  in  (3.98)  were  given  the  values  ax  = —1.25, 
a2  = 1.25,  and  (3  = 0.5.  We  chose  thresholds  with  a large  spread  between  them 
to  reduce  the  chance  of  them  overlapping  when  perturbed  by  a random  effect.  For 
the  simulations  we  varied  the  structure  for  the  covariance  matrix  E of  the  thresh- 
old random  effects  u;.  That  is,  we  varied  (crj,p,  cr|)  where  a\  denotes  the  variance 
component  for  the  rth  threshold,  r — 1,2,  and  p denotes  the  correlation  between  the 
thresholds.  For  the  variance  components  we  considered  two  situations:  an  extreme 
situation  in  which  the  variance  components  were  quite  different  and  a situation  where 
the  thresholds  had  similar  variabilities.  We  then  varied  the  correlation  between  the 
random  effects,  allowing  both  positive  and  negative  correlations.  The  combinations 
of  the  factor  levels  given  in  the  table  below  provide  the  settings  for  the  15  simula- 
tions. It  is  also  informative  to  consider  the  ideal  situation  in  which  the  data  truly 
come  from  the  shifted  threshold  model.  Thus  we  performed  three  simulations  where 
both  thresholds  were  shifted  by  the  same  random  effect  with  variances  0.16,  0.5,  and 
1.0.  From  a pilot  study  it  was  determined  that  simulation  sizes  of  300  would  achieve 
Monte  Carlo  error  estimates  for  the  regression  parameter  of  less  then  0.01. 


124 


p 

-0.8 

(0.16,  1.0) 

-0.5 

(1.0,  0.16) 

0 

(0.5,  0.7) 

0.5 

0.8 

We  obtained  estimates  for  the  shifted  threshold  models  using  the  adaptive  Gauss- 
Hermite  algorithm  with  15  quadrature  points.  The  initial  starting  value  for  the 
variance  component  of  the  random  intercept  was  0.5,  while  the  estimates  obtained 
from  the  fixed  effects  version  of  model  (3.98)  were  used  for  the  remaining  parameters. 
For  each  simulation  we  recorded  the  estimated  bias  in  the  parameter  estimates  (where 
the  bias  of  6 is  defined  as  E(9)  — 6 ),  the  average  standard  error  of  /?  calculated  from 
the  observed  information  matrix,  and  the  Monte  Carlo  standard  error  of  /3  obtained 
over  all  the  simulations.  We  also  estimated  the  standard  deviation  of  the  random 
effect  for  the  shifted  threshold. 

Tables  3.12  and  3.13  contain  the  results  of  the  15  simulations,  while  Table  3.14 

contains  the  results  of  the  three  ideal  runs.  In  Table  3.12  are  the  results  for  the 

extreme  situation  where  the  variances  (expressed  as  standard  deviations)  were  very 

different,  broken  down  by  the  value  of  the  correlation  p.  Since  the  variances  were 

quite  different,  we  ran  simulations  with  the  large  variance  component  associated  with 

the  first  threshold  and  also  with  the  second  threshold.  As  one  would  expect,  there  was 

considerable  bias  in  the  estimated  threshold  that  coincided  with  the  larger  variance 

component,  regardless  of  the  correlation.  We  do  see,  however,  that  the  largest  biases 

for  the  thresholds  occurred  when  the  correlation  was  negative.  For  example,  when 

the  correlation  was  -0.8,  the  second  threshold  with  (au  p,  <r2)  = (-4,  -0.8, 1.0)  had  an 

estimated  bias  of  -0.165,  or  a percent  bias  of = - °'165  = -13  2%  The 

True  Value  1.25 


125 


threshold  corresponding  to  the  smaller  variance  component  had  smaller  estimated 
bias,  ranging  from  -0.090  to  0.075. 

The  shifted  threshold  model  provided  biased  estimates  for  the  regression  param- 
eter /3  as  well.  The  largest  estimated  bias  occurred  when  (ai,p,  <r2)  = (1.0, —0.8,  .4) 
with  a value  of -0.029,  or  a percent  bias  of  -5.8%.  In  fact  there  is  a clear  pattern  seen 
in  Table  3.12  that  larger  estimated  biases  occurred  when  the  correlation  was  zero  or 
negative.  For  positive  correlations,  the  largest  bias  was  -0.014  (percent  bias  of -2.8%). 
We  also  see  that,  for  this  particular  model,  larger  biases  generally  occurred  in  (3  when 
the  larger  variance  component  was  associated  with  the  first  threshold.  An  exception 
occurred  for  the  p — 0 case.  These  differences,  however,  could  be  due  to  Monte  Carlo 
error  which  was  less  than  0.01. 

Table  3.12  also  contains  the  average  estimate  of  the  standard  deviation  for  the 
shifted  threshold  o A.  This  estimate  is  not  very  meaningful  as  the  true  thresholds 
vary  individually.  However,  it  is  interesting  to  note  that  the  smallest  estimates  of  aA 
were  found  in  the  p = —0.8  case  and  were  about  0.28  while  the  remaining  simula- 
tions produced  averages  between  0.45  and  0.7.  We  also  included  the  estimates  of  the 
standard  error  for  /3  from  the  observed  information  matrix  and  the  Monte  Carlo  esti- 
mate. Neither  are  very  informative  though.  Ideally  we  should  compare  the  observed 
information  estimate  to  that  obtained  in  the  varying  threshold  model  to  see  if  they 
agree.  But  due  to  the  instability  of  the  varying  threshold  model,  we  were  not  able  to 
accomplish  this. 

We  now  examine  the  second  table,  Table  3.13,  which  contains  the  results  using 
similar  variance  component  estimates.  Since  the  variance  component  estimates  were 
so  similar  we  only  ran  the  simulations  for  the  given  ordering  of  the  variance  compo- 
nents. We  see  similar  patterns  for  these  results  as  seen  in  Table  3.12.  Namely,  larger 
estimated  biases  occurred  in  the  threshold  parameters  and  the  regression  coefficient  yd 
when  the  correlation  was  zero  or  negative.  Indeed,  when  the  correlation  was  positive 


126 


Table  3.12:  Estimated  bias  of  parameter  estimates  for  the  shifted  threshold  model 
using  data  simulated  from  the  varying  threshold  model  with  extreme  variation  dif- 
ferences in  the  thresholds.  SE(/3)0  and  SE0)MC  denote  the  standard  error  of  /3 
computed  from  the  observed  information  matrix  and  from  the  Monte  Carlo  estimates 
across  all  simulations.  The  estimate  a a is  the  average  standard  deviation  estimate  of 
the  shifted  thresholds. 


THRESHOLD  COVARIANCE  STRUCTURE 
(cn,  P,  cr2) 

(•4,  p,  1) 

(1.  P,  -4) 

(-4,  P,  1) 

(1,  P,  -4) 

P = 

0 

al 

-0.050 

0.132 

a 2 

-0.142 

0.017 

0 

-0.023 

-0.018 

(SE(/3)o) 

(.075) 

(.077) 

(SE(/3)mc) 

(.073) 

(.072) 

oa 

0.559 

0.554 

P = 

.5 

P = 

-.5 

OL\ 

-0.068 

0.116 

-0.039 

0.144 

«2 

-0.123 

0.049 

-0.155 

0.008 

0 

-0.007 

-0.011 

-0.021 

-0.023 

(SE(/?)o) 

(.076) 

(.079) 

(.075) 

(.077) 

(SE(/3)„c) 

(.079) 

(.076) 

(.078) 

(.075) 

°A 

0.633 

0.620 

0.484 

0.633 

P = 

.8 

P = 

-.8 

ai 

-0.090 

0.108 

-0.024 

0.152 

a 2 

-0.117 

0.075 

-0.165 

-0.001 

0 

0.004 

-0.014 

-0.020 

-0.029 

(SE  0)o) 

(.076) 

(.079) 

(.075) 

(.075) 

(SE  0)MC) 

(.076) 

(.077) 

(.070) 

(.074) 

Oa 

0.685 

0.670 

0.274 

0.275 

127 


Table  3.13:  Estimated  bias  of  parameter  estimates  for  the  shifted  threshold  model  us- 
ing data  simulated  from  the  varying  threshold  model  with  similar  variation  differences 
in  the  thresholds.  SE0)o  and  SE0)MC  denote  the  standard  error  of  /3  computed 
from  the  observed  information  matrix  and  from  the  Monte  Carlo  estimates  across 
all  simulations.  The  estimate  o & is  the  average  standard  deviation  estimate  of  the 
shifted  thresholds. 


THRESHOLD  COVARIANCE 
STRUCTURE  {ai,p,a2) 

(.71,  p,  .84) 

(.71,  p,  .84) 

Oil 

P = o 
0.031 

a 2 

-0.087 

P 

-0.026 

(SE  0)o) 

(.073) 

(SE  0)MC) 

(.074) 

<?A 

0.557 

P=-  5 

P=~- 5 

Oil 

0.006 

0.054 

Oi2 

-0.059 

-1.113 

P 

-0.003 

-0.027 

(SE  0)o) 

(.073) 

(.075) 

(SE(/3)mc) 

(.074) 

(.074) 

<?  A 

0.669 

0.411 

P=-  8 

p = -.8 

Oil 

-0.022 

0.051 

Oil 

-0.038 

-0.127 

P 

0.003 

-0.034 

(SE(/j)o) 

(.078) 

(.074) 

(SE0)MC) 

(.080) 

(.071) 

(7A 

0.736 

0.326 

128 


Table  3.14:  Estimated  bias  of  parameter  estimates  for  the  shifted  threshold  model 
using  data  simulated  from  the  shifted  threshold  model.  SE(/?)0  and  SE0)MC  denote 
the  standard  error  of  /5  computed  from  the  observed  information  matrix  and  from 
the  Monte  Carlo  estimates  across  all  simulations.  The  estimate  aA  is  the  average 
standard  deviation  estimate  of  the  shifted  thresholds. 


SHIFTED  THRESHOLD 
STANDARD  DEVIATION 


a = A 

a = .71 

a — 1 

Oil 

-0.009 

-0.018 

-0.019 

C*2 

-0.005 

-0.007 

-0.013 

P 

-0.002 

-0.002 

-0.002 

(SE  0)o) 

(.076) 

(.080) 

(.081) 

(SE0)mc) 

(.077) 

(.083) 

(.080) 

OA 

0.394 

0.706 

1.000 

the  estimated  bias  for  the  regression  coefficient  /3  was  only  (in  absolute  value)  0.003, 
or  a percent  bias  of  less  than  one  percent.  The  largest  bias  in  the  regression  coefficient 
(-0.034)  was  found  with  p = —0.8.  It  is  also  interesting  to  note  that  the  estimated 
standard  deviation  for  the  random  effect  increased  from  0.326  with  p = -0.8  to  0.736 
with  p = 0.8.  This  latter  value  being  near  the  average  (0.775)  of  the  two  standard 
deviation  values  assigned  to  the  varying  thresholds. 

In  Table  3.14  are  the  results  when  the  data  truly  come  from  the  shifted  threshold 
model.  As  one  would  expect,  the  shifted  threshold  model  fit  the  true  data  accurately. 
In  this  ideal  situation,  the  absolute  estimated  bias  of  the  regression  parameter  for  all 
three  shifted  threshold  variabilities  was  less  than  0.01,  which  can  be  explained  by  the 
Monte  Carlo  error.  Note  that  these  estimated  biases  are  very  similar  to  those  found 
in  Table  3.13  for  the  similar  threshold  variances  with  positive  correlation.  However, 
the  estimated  bias  for  the  threshold  parameters  is  much  less  under  the  true  model. 
The  estimated  standard  deviations  of  the  thresholds  are  very  close  to  the  true  values. 
In  addition,  the  standard  errors  from  the  observed  information  matrix  and  the  Monte 
Carlo  estimate  are  very  similar. 


129 


From  these  simulations,  we  see  that  the  bias  in  the  regression  parameter,  the 


parameter  of  interest,  decreases  as  the  correlation  between  the  threshold  random 
effect  increases  and  the  variance  components  for  the  thresholds  become  more  similar. 
In  fact,  the  estimated  bias  of  the  regression  parameter  with  similar  threshold  variances 
and  positive  correlation  is  similar  to  that  obtained  when  the  data  is  simulated  under 
the  true  model.  The  reason  for  this  is  made  clear  by  considering  the  following.  In 
simulating  the  values  of  the  random  effects  for  the  varying  threshold  model,  we  used 
a bivariate  normal  distribution  with  covariance  matrix 


(Ti  poxa2 


(Jr, 


(3.99) 


paxa2 

The  covariance  matrix  for  the  shifted  threshold  model  can  be  viewed  in  the  same 
form  as  (3.99),  but  with  p = 1 and  a2  = erf  = a2.  Thus  as  p approaches  positive 
1.0,  and  the  variance  components  for  the  varying  threshold  become  more  similar,  the 
shifted  threshold  model  should  perform  well. 

It  is  difficult  to  say  with  certainty  how  thresholds  vary  in  “real  life”  datasets  when 
the  data  are  collected  longitudinally  or  in  clusters.  Recall  that  in  the  latent  variable 
motivation  for  the  cumulative  logit  model  (Section  3.2.2),  the  thresholds  partitioned 
the  underlying  continuous  response  Y*  into  a coarser,  categorized  version,  Y.  That 


is 


Y = r ar_x  <Y*  < ar,  r = l,---,R, 

with  a:o  = — oo  and  olr  = oo.  The  shifted  threshold  model  moves  the  “windows” 
(distance  between  thresholds)  for  each  response  along  the  underlying  continuous  re- 
sponse scale,  but  keeps  them  the  same  size.  The  varying  threshold  model  allows  the 
windows”  to  move  and  change  size.  This  would  allow,  for  example,  some  subjects  to 
have  wider  “windows”  for  particular  responses,  and  smaller  for  others.  We  feel  that  if 
the  thresholds  did  vary  individually,  they  would  probably  move  in  similar  directions 


130 


(i.e.  be  positively  correlated)  for  which  the  shifted  threshold  model  would  provide 
reasonable  estimates. 

An  open  question  is  how  the  standard  errors  relate  between  the  two  models. 
We  would  expect  that  the  standard  error  estimates  from  the  two  models  would  be 
similar,  especially  as  the  variance  components  for  the  thresholds  become  similar  and 
the  correlation  nears  1.0.  From  the  results  of  the  wine  tasting  dataset  it  was  seen 
that  the  standard  errors  between  the  more  complex  model  and  the  simpler  model  were 
quite  similar.  Ideally,  one  would  like  to  account  for  all  sources  of  extraneous  variability 
when  modeling.  However,  when  doing  so  produces  a much  more  complicated  model 
for  both  fitting  and  interpretation,  use  of  a simpler  model  that  attempts  to  account 
for  most  of  the  heterogeneity  is  probably  more  appropriate. 


CHAPTER  4 

NONPARAMETRIC  MAXIMUM  LIKELIHOOD  ESTIMATION  IN 
MULTIVARIATE  GENERALIZED  LINEAR  MIXED  MODELS 

4.1  Introduction 

In  the  previous  chapter,  we  considered  random  effects  models  for  nominal  and 
ordinal  response  data  where  we  assumed  that  the  random  effects  followed  a multi- 
variate normal  distribution.  The  multivariate  normal  assumption  allows  for  a variety 
of  covariance  structures  for  the  modeling  of  the  random  effects.  Another  advantage  of 
such  an  assumption  is  that  the  random  effects  distribution  can  be  estimated  more  ac- 
curately when  the  cluster  sizes  are  small.  The  motivation  of  generalized  linear  mixed 
models  as  extensions  of  linear  mixed  models  has  also  contributed  to  the  popularity 
of  the  normality  assumption.  In  spite  of  its  popularity  and  attractive  features,  the 
assumption  of  normality  can  rarely  be  verified.  Thus  a concern  of  making  such  an 
assumption  is  the  possible  misspecification  of  the  random  effects  distribution. 

There  have  been  a number  of  studies  that  have  investigated  the  effects  of  mis- 
specification of  the  random  effects  distribution,  one  of  which  was  the  seminal  paper 
by  Heckman  and  Singer  (1984).  Examining  models  for  censored  longitudinal  eco- 
nomic data,  Heckman  and  Singer  (1984)  showed  that  the  fixed  parameter  estimates 
in  a particular  Weibull  regression  model  were  highly  sensitive  to  the  parametric  as- 
sumption about  the  random  effects  distribution.  Though  the  severe  changes  reported 
in  the  paper  were  specific  to  the  model  and  data  combination,  the  results  provided 
evidence  that  misspecification  could  impact  the  estimates,  and  ultimately  the  infer- 
ence in  a model.  Others  have  shown  supportive,  although  less  dramatic  evidence 
for  other  types  of  models.  Davies  (1987)  performed  a small  study  in  which  he  ana- 
lyzed simulated  residential  mobility  data  using  a probit  model.  For  each  household, 


131 


132 


a response  profile  consisting  of  zeros  and  ones  was  simulated  according  to  whether 
or  not  the  household  moved  in  that  year.  The  probability  of  moving  was  dependent 
on  five  household  specific  covariates  whose  parameter  values  were  determined  from  a 
previous  study  of  American  households.  Davies  (1987)  simulated  the  data  using  both 
a normal  mixing  distribution  and  a triangular  mixing  distribution.  To  maximize  the 
log-likelihood,  Davies  (1987)  assumed  that  the  mixing  distribution  was  a continuous 
rectangular  distribution  and  used  Gauss-Legendre  quadrature  to  integrate  over  the 
mixing  distribution.  One  reason  for  choosing  the  rectangular  distribution  was  that 
the  numerical  integration  was  restricted  to  a finite  range.  Davies  (1987)  concluded 
that  the  discrepancies  in  the  parameter  estimates  are  less  extreme  if  the  true  and 
assumed  mixing  distributions  are  at  least  somewhat  similar.  More  recently,  Neuhaus 
et  al.  (1992)  and  Butler  and  Louis  (1992)  showed  for  the  binary  random  intercept  lo- 
gistic model  that  the  model  parameters  were  indeed  inconsistent  if  the  random  effects 
distribution  was  misspecified.  The  magnitude  of  the  bias,  however,  was  found  to  be 
typically  small.  Neuhaus  et  al.  (1992)  concluded  that  the  assumption  of  normality 
resulted  in  generally  robust  inferences  about  the  regression  parameters  under  mis- 
specification  of  the  random  effects  distribution.  In  light  of  these  conclusions,  there 
has  been  a considerable  amount  of  recent  work  focused  on  nonparametric  approaches 
to  fitting  these  models  that  avoids  parametric  assumptions  about  the  distribution  of 
the  random  effects  (Heckman  and  Singer  1984;  Davies  1987;  Wood  and  Hinde  1987; 
Follmann  and  Lambert  1989;  Butler  and  Louis  1992;  Wedel  and  DeSarbo  1995;  Aitkin 
1996,  1999). 

In  this  chapter,  we  propose  a class  of  alternative  models  for  repeated  nominal  and 
ordinal  response  data  in  which  the  random  effects  or  mixing  distribution  is  estimated 
nonparametrically.  The  proposed  models,  which  can  be  considered  as  extensions  of 
those  by  Aitkin  (1996,  1999),  are  estimated  by  way  of  maximum  likelihood  estimation, 
providing  nonparametric  maximum  likelihood  (NPML)  estimates  of  both  the  fixed 


133 


parameters  and  the  mixing  distribution.  We  begin  in  Section  4.2  with  the  definition 
of  the  proposed  models.  Within  this  section  we  will  describe  an  EM  algorithm  for 
the  NPML  estimation  of  the  unknown  parameters  and  the  mixing  distribution.  We 
will  discuss  methods  of  inference  as  well  as  provide  a method  for  calculating  standard 
errors.  In  Section  4.3  we  will  provide  sufficient  conditions  under  which  the  proposed 
models  are  identifiable.  We  will  then,  in  Section  4.4,  apply  the  proposed  models  to  the 
wine  tasting  dataset  considered  in  the  previous  chapter.  We  conclude  this  chapter  by 
reporting  the  results  of  two  simulation  studies.  The  first  study  compares  the  models  of 
the  previous  chapter  to  the  proposed  models  under  a variety  of  mixing  distributions. 
In  the  second  study  we  investigate  the  validity  of  the  Wald  and  likelihood-ratio  tests 
within  the  context  of  the  NPML  model  for  making  inference  on  the  fixed  effects. 

4.2  Nonparametric  Maximum  Likelihood  Estimation 

As  in  Chapter  3,  the  proposed  models  will  be  developed  as  extensions  of  multi- 
variate generalized  linear  models.  The  model  definition  and  subsequent  estimation 
algorithm  will  be  presented  for  a general  link  function  g(7r)  or  response  function  h(r/). 
Thus  the  form  of  the  design  matrix  Z and  parameter  vector  (3  are  assumed  to  be  ap- 
propriate for  the  desired  link  function  with  the  specific  forms  being  found  in  Chapter 
2.  In  the  discussion  that  follows,  only  models  in  which  thresholds  of  the  nominal  or 
ordinal  regression  model  are  allowed  to  vary  across  subjects  will  be  considered.  In 
Chapter  5 we  will  look  at  extending  the  models  to  allow  for  bivariate  random  effect 
structures. 

As  before,  denote  the  multinomial  response  vector  for  the  jth  observation  from 
the  ith  cluster  as  y ^ with  corresponding  multinomial  sample  size  and  covariate 
vector  Xjj , j = 1 , • • ■ , T; , i = 1,  • • • , n.  We  assume  that  conditional  on  an  unobserved 
random  variable  rq,  y ^ has  the  multivariate  exponential  form  with  response  function 
h(r7jj)  and  linear  predictor  = Zijfi  + Uj.  In  addition  we  assume  that  observations 
between  clusters  are  independent,  and  observations  within  clusters  are  conditionally 


134 


independent.  That  is,  we  assume  for  the  complete  response  vector  y and  random 
effects  vector  u that 

n Ti 

/( y I u)  = n E[  /(y«  1 0;  ui). 

t=i  j=i 

where  /(y^  | /3;Uj)  denotes  the  multinomial  distribution.  Denoting  the  mixture 
distribution  by  G,  the  likelihood  for  the  model  is  given  by 


w)=ri/- / 


Ti 


J [ /(yy  I 


(4.1) 


u=i 


In  Chapter  3 we  estimated  the  parameters  /3  and  the  mixing  distribution  G in 
(4.1)  by  assuming  G was  normal  and  maximizing  the  marginal  likelihood  obtained  by 
numerical  integration.  To  avoid  possible  misspecification,  we  now  consider  estimating 
G nonparametrically.  Specifically,  we  assume  that  G is  a discrete  distribution  function 
with  unknown  finite  support  size  K , masses  p = (pi,---  ,Pk),  and  mass  points 
in  = (mi,--  - , m#),  where  Ylk=iPk  = 1 and  Pk  > 0,  k = 1,  • • • , K.  Considering 
only  the  class  of  discrete  distributions  with  finite  support  size  is  not  restrictive,  since 
it  has  been  well  established  that  the  NPML  estimate  of  a (possibly  continuous)  mixing 
distribution  is  concentrated  on  a finite  number  of  points  (Kiefer  and  Wolfowitz  1972; 
Laird  1978;  Lindsay  1983a,  1983b).  For  fixed  support  size,  the  resulting  likelihood  is 
given  by 

n K 

£(/3,p,m)  = IIE  Pkfifi  | P,mk),  (4.2) 

t=i  fc=i 

where  /( y*  | /9,m*)  = fljli  f(?ij  I 0,mk). 

An  abundance  of  literature  exists  in  the  general  area  of  finite  mixture  modeling, 
with  a thorough  introduction  being  found  in  Titterington  et  al.  (1985).  The  appli- 
cation of  mixture  models  to  the  regression  setting  has  become  increasingly  popular 
as  well.  There  have  been  a number  of  authors  who  have  proposed  special  cases  of 
the  models  defined  by  likelihood  (4.2).  Follmann  and  Lambert  (1989)  considered  a 


135 


binary  logistic  regression  model  in  which  the  intercept  was  assumed  to  vary  according 
to  a discrete  distribution.  The  models  that  they  considered  could  account  for  overdis- 
persion at  the  binomial  level.  Lindsay  et  al.  (1991)  later  related  the  binary  logistic 
regression  mixture  model  to  the  binary  Rasch  model  (Rasch  1961).  Butler  and  Louis 
(1992)  also  considered  a binary  logistic  regression  model  with  a random  intercept, 
but  allowed  for  longitudinal  observations  as  well.  We  noted  in  Chapter  3 that  Adams 
and  Wilson  (1996)  and  Adams  et  al.  (1997)  both  assumed  a discrete  distribution  for 
their  shifted  and  varying  threshold  Rasch  models.  In  their  approach,  however,  they 
assumed  that  the  mass  points  were  known  and  only  estimated  the  masses.  Both 
Wedel  and  DeSarbo  (1995)  and  Aitkin  (1996,  1999)  have  considered  nonparametric 
mixture  extensions  in  generalized  linear  models.  Aitkin  (1996,  1999)  generalized  work 
by  Wood  and  Hinde  (1987)  by  modeling  overdispersion  and  heterogeneity  in  the  class 
of  generalized  linear  models  through  the  use  of  nonparametric  mixtures.  Using  a 
similar  motivation,  Wedel  and  DeSarbo  (1995)  proposed  a general  latent  class  model 
which  allowed  for  a generalized  linear  model  within  each  class.  Both  Wedel  and  De- 
Sarbo (1995)  and  Aitkin  (1996,  1999)  used  identical  EM  algorithms  for  estimation  of 
the  parameters,  though  Wedel  and  DeSarbo  (1995)  estimated  a set  of  parameters  for 
each  latent  class.  We  now  present  a similar  EM  algorithm  for  maximizing  (4.2).  Fol- 
lowing the  section  on  estimation,  we  will  discuss  in  Section  4.2.2  methods  for  making 
inference  in  the  NPML  approach. 

4.2.1  Estimation 

In  likelihood  (4.2)  we  have  replaced  the  intractable  normal  integrals  from  the 
previous  chapter  with  finite  summations.  Though  this  would  seem  to  be  a great 
simplification,  finding  the  maximum  likelihood  values  for  /3,  p,  m,  and  K is  still 
problematic  and  often  computationally  intensive.  Of  most  difficulty  is  determining 
the  location  of  the  mass  points  m.  The  approaches  for  maximizing  (4.2)  can  be 
generally  grouped  into  those  that  treat  K as  fixed  and  those  that  estimate  K within 


136 


the  maximization  algorithm.  The  EM  algorithm  that  we  propose  belongs  to  the 
former  group,  however  we  will  first  briefly  discuss  some  of  the  algorithms  in  the 
latter.  A complete  discussion  of  both  types  of  algorithms  can  be  found  in  Bohning 
(1995). 

Many  of  the  algorithms  that  estimate  K within  the  maximization  procedure  are 
based  on  the  geometric  interpretation  of  mixture  models  as  described  by  Lindsay 
(1983a,  1983b).  Lindsay  (1983a)  showed  that  finding  the  maximum  likelihood  es- 
timate of  a mixture  distribution  was  equivalent  to  maximizing  a concave  function 
over  a convex  set.  Thus  methods  used  for  convex  optimization  can  be  applied  to 
mixture  models  as  well.  In  particular,  directional  derivatives  or  gradient  functions 
can  be  used  as  directional  guides  for  finding  the  maximum.  Such  functions  are  uti- 
lized in  a number  of  vertex  direction  algorithms  such  as  the  vertex  direction  method 
(Lindsay  1983a),  the  vertex  exchange  method  (Lesperance  and  Kalbfleisch  1992),  and 
the  intra  simplex  direction  method  (ISDM)  (Lesperance  and  Kalbfleisch  1992).  Such 
methods  either  sequentially  increase  the  support  size  at  each  iteration  or  adaptively 
add  or  subtract  points  at  each  iteration.  Follmann  and  Lambert  (1989)  utilized  a 
combination  of  directional  derivatives  and  the  EM  algorithm  for  maximizing  mix- 
tures of  binomials.  In  general,  methods  that  estimate  the  support  size  adaptively  are 
complex  to  program.  However,  Lesperance  and  Kalbfleisch  (1992)  have  shown  that 
convergence  for  certain  variants,  such  as  the  ISDM,  can  be  extremely  fast. 

An  alternative  method  for  finding  the  maximum  likelihood  estimate  of  a mixing 
distribution  is  to  treat  K as  fixed.  The  method  is  based  on  a result  mentioned 
previously  that  the  nonparametric  maximum  likelihood  estimate  has  a finite  support 
size.  Using  this  result,  one  can  sequentially  maximize  the  likelihood  for  fixed  K = 
1,  2,  • • • until  convergence  is  obtained.  The  EM  algorithm  is  a popular  algorithm  for 
computing  the  maximum  at  each  fixed  K because  of  its  numerical  simplicity  and 
guaranteed  monotonicity.  However  it  can  be  extremely  slow  and  convergence  to  a 


137 


local  maximum  is  possible.  In  light  of  this,  we  first  outline  the  basic  EM  algorithm 
for  maximizing  (4.2).  We  then  discuss  methods  for  accelerating  the  convergence  of 
the  algorithm.  We  conclude  this  section  with  comments  concerning  the  application 
of  the  algorithm. 

EM  algorithm 

We  use  the  EM  algorithm  to  estimate  the  regression  parameters  (3,  the  mass  points 
m,  and  the  mass  p.  The  implementation  of  the  EM  algorithm  relies  on  the  complete 
log-likelihood  of  both  the  observed  and  unobserved  data.  For  the  given  models,  the 
complete  log-likelihood  has  the  form 

n n 

log  L(P,  G)  = log  /(yj  I p,  Ui)  + Y log  G(ui),  (4.3) 

i=l  t=l 

where  G(ui)  is  the  K component  discrete  distribution  function  for  Uj  with  mass  points 
m and  masses  p.  The  complete  log-likelihood  is  maximized  using  an  iterative  EM 
algorithm.  In  the  E-step  the  complete  log-likelihood  is  replaced  by  its  expectation, 
calculated  on  the  basis  of  provisional  estimates  of  P and  G.  This  expectation  is  then 
maximized  in  the  M-step  with  respect  to  P and  G to  obtain  new  provisional  estimates. 
These  two  steps  are  then  alternated  until  no  further  improvement  in  the  likelihood 
occurs.  The  details  for  the  E-step  and  M-step  now  follow. 

In  the  E-step  at  the  (s  + l)th  iteration,  the  expectation  of  the  complete  log- 
likelihood  (4.3)  is  calculated  with  respect  to  the  conditional  distribution 
/(u  I y>/3^)m^,P^)  where  P^3\m^s\  and  p(s)  are  the  estimated  parameters  from 
the  previous  iteration.  Using  independence,  Bayes  Rule,  and  expressing  G(ui ) in 


138 


terms  of  the  masses  p and  mass  points  m,  one  obtains  the  expectation 
£[logL(/3,m,p  | /3(s),m(s),p(s))]  = 

y^y^  [log/(y>  1 rnk,P)  + logpfc]  P^Kfi  I mk\P{s)) 

i=i  k= 1 Ep!s)/(yi  I ^(s),/3(s)) 

1=1 

n K 

= 5ZK?  l°g/(y«  I rnk,  (3)  + a[l]  logp*],  (4.4) 

i= 1 fc=l 

where 

As)  = P{ks)f(yi  | mk\p(s]) 

aik  k 

Ep!'V(yi  I m<->,  /3<*>) 

i=l 

The  {a|^}  can  be  interpreted  as  the  estimated  posterior  probability  that  the  response 
vector  (ya,  ...,y iT.)  for  subject  i comes  from  component  k.  Also  note  that  the  {a-^} 
are  known  constants  depending  only  on  the  parameter  estimates  from  the  previous 
iteration. 

The  M-step  consists  of  maximizing  (4.4),  the  expectation  of  the  complete  log- 
likelihood,  with  respect  to  /3,  m,  and  p.  The  second  term  on  the  right  of  (4.4) 
is  not  a function  of  (3  or  m and  can  be  maximized  separately  from  the  first  term. 
Maximizing  EiEfc^^ogPfc  subject  to  Ef=i  pk  = 1 yields  simply 

p{ks) aisk/n- 

i— 1 

Since  the  {a^}  are  known,  the  first  term  on  the  right  of  (4.4), 

n K n Ti  K 

EE  <4  k Mg  / (yi  | P,mk)  = EEE  «<?log  f{yij  | P,mk), 

i=l  k= 1 t=l  j=l  k= 1 

can  be  recognized  as  the  log-likelihood  of  a weighted  multivariate  generalized  linear 
model  with  known  weights  {a^}.  Wood  and  Hinde  (1987)  noted  that  mk,  k = 


139 


can  be  easily  estimated  by  incorporating  a K level  factor  in  the  model  in 
place  of  mk.  Let  dijk  = (dijki, dijkjK_1)  be  the  corresponding  vector  of  K - 1 
dummy  variables  for  the  K level  factor  where 

I 1 if  / = Jfc,  l = 1 

dijkl  = \ 

0 otherwise, 

and  let  T>ijk  = lq®dijk.  The  linear  predictor  can  now  be  re-written  as  r]ijk  = Z*jk0* 
where  0*  = (0  ,mu  ...,  mK-i)  and  z:jk  = [Zij  | T>ijk].  This  new  representation  also 
requires  the  replication  of  the  response  vectors  such  that  y*jk  = yijt  k = 1 
Thus  the  first  term  on  the  right  of  (4.4)  is  a weighted  multivariate  generalized  linear 
model  with  form 

E(  y*jk)  = h (Z*jk0*) 

and  known  weights 

The  Fisher  scoring  algorithm  is  used  to  find  the  MLE  of  0*.  The  form  of  the 
score  function  and  expected  information  matrix  for  weighted  multivariate  generalized 
linear  models  are  known  (Fahrmeir  and  Tutz  1994)  and  are  given  below.  The  score 
function  for  the  algorithm  is  given  by 

S(/3*  | /?•«)  = £X>*  E - l^t) 

i=l  /e=1  j= 1 

where  Dijk  = dh(Z*jk0*) / dr]  and  R7rijk  — dmg(/j,ijk)  - The  expected  infor- 

mation matrix  for  the  weighted  multivariate  generalize  linear  model  is 

Fe(P-  I /?•<■>)  = a«;» x: z^Diik^D:jkz;jk. 

t=l  fc=l  j= 1 

Thus  within  the  M-step  of  the  EM  algorithm,  the  Fisher  scoring  algorithm  is 

= P'p  + Fb'(/3;  \ 0-M)  S(n;  \ p'M), 


140 


given  starting  value  (3*p. 

The  NPML  version  of  the  EM  algorithm  with  a Fisher  scoring  algorithm  embedded 
in  each  M-step  can  be  summarized  as  follows: 

0.  Calculate  initial  values  (3*^  and  p(°). 

For  s=0,l,2,... 

1.  Calculate  posterior  probabilities  a[sk] , i = 1,  k = 1,  using  /3*(s)  and 
P(s)-  Calculate  p(s+1)  using  a[sk\  i = 1, n,  k = 1, K. 

2.  Carry  out  the  Fisher  scoring  algorithm  to  obtain  /3*(s+1)  using  the  weights  a[sk\ 
i = 1 k = 1 

The  algorithm  is  defined  in  terms  of  a fixed  support  size  K.  To  determine  K,  the  al- 
gorithm is  successively  applied  while  incrementing  K until  convergence  in  parameter 
estimates  and  the  likelihood  value  is  obtained.  We  will  discuss  the  issue  of  conver- 
gence as  well  as  how  one  obtains  starting  values  following  the  next  section  on  EM 
acceleration. 

Acceleration  of  the  EM  algorithm 

The  EM  algorithm  is  a powerful  tool  for  maximizing  likelihoods  with  unobserved 
data  and  has  many  attractive  characteristics  such  as  guaranteed  monotonicity  of  the 
log-likelihood  and  simplicity  of  implementation.  One  major  drawback,  however,  is  its 
speed  of  convergence  which  depends  on  the  relative  size  of  the  unobserved  information 
on  the  unknown  parameters.  In  terms  of  mixtures,  the  rate  of  convergence  will  depend 
on  the  amount  of  information  about  the  mixing  distribution  that  is  available  from 
the  observed  data  alone.  The  rate  of  convergence  is  also  adversely  affected  when  the 
unknown  parameters  are  near  the  boundary  of  the  parameter  space.  This  can  occur, 
for  example,  if  one  of  the  mass  points  is  located  at  plus  or  minus  infinity.  There  have 
been  many  suggestions  for  speeding  up  the  convergence  of  the  EM  algorithm.  We 
discuss  two  of  these  methods  which  can  be  used  to  accelerate  the  NPML  algorithm. 
In  the  first  method,  the  M-step  of  the  EM  algorithm  is  modified  while  maintaining 


141 


the  overall  structure  of  the  algorithm.  In  the  second  method,  the  EM  algorithm  is 
initially  used  and  then  replaced  with  a faster  converging  algorithm.  Though  discussed 
as  two  separate  methods,  one  can  easily  combine  the  methods  for  faster  convergence 
yet. 

The  first  method  that  we  will  discuss,  called  the  Expectation  Conditional  Maxi- 
mization Either  (ECME)  algorithm  (Liu  and  Rubin  1994),  is  an  extension  of  the  Ex- 
pectation Conditional  Maximization  (ECM)  algorithm  (Meng  and  Rubin  1993)  which 
itself  is  a generalization  of  the  EM  algorithm.  To  motivate  the  ECME  algorithm  we 
begin  by  outlining  the  EM  algorithm  and  its  extension  to  the  ECM  algorithm.  Fol- 
lowing Meng  and  Rubin  (1993),  let  Y = ( Yobs,Ymis ) be  the  complete  and  missing 
data  with  density  f(Y  \ 0)  where  0 e 0 is  an  unknown  parameter  vector.  Also  let 
h(Y0bs  | 0)  denote  the  density  of  the  observed  data  and  k(Y  | Yobs,  0)  the  conditional 
density  of  Y given  the  observed  data.  Note  that 

h(Y0bs  | 0)  oc  J f(Y\0)  dYmis. 

Our  goal  is  to  find  the  maximum  likelihood  estimate,  0,  of  0 which  maximizes  the 
observed  data  likelihood,  which  we  denote  as 

L{0)  = log  h(Yobs  | 0)  = Q(0  | 0')  - H(0  | 0'),  (4.5) 

where 

Q(e  | e')  = B{iog/(Y  | e)  | Yob„e'} 

is  the  expected  complete  data  likelihood,  and 

H(0  | 0')  = E{\ogk{Y  | Yobs,0 ) | Yobs,0'} 

is  the  expected  missing  data  likelihood.  Starting  with  an  initial  value  0(o) , the  EM 
algorithm  maximizes  (4.5)  by  iteratively  maximizing  Q(0  \ 0 ) over  0.  That  is,  given 


142 


the  estimate  0(s)  at  the  sth  iteration,  0(s+1)  is  found  by  first  computing  Q(0  | 0 W) 
as  a function  of  0,  and  then  finding  0^+1^  that  maximizes  Q(0  | 0^).  Iteration 
continues  between  the  expectation  step  (E-step)  and  the  maximization  step  (M-step) 
until  convergence. 

For  some  applications  of  the  EM  algorithm,  the  M-step  can  not  be  computed  in 
closed  form.  For  situations  such  as  this,  Meng  and  Rubin  (1993)  proposed  the  ECM 
algorithm  which  replaces  the  maximization  of  Q(0  | 0(s) ) with  several,  hopefully 
simpler  conditional  maximizations.  In  the  ECM  algorithm,  the  M-step  is  replaced 
by  t = 1,  ■■■  ,T  constrained  or  conditional  maximization  (CM)  steps  where  in  each 
some  subset  of  the  parameters  is  fixed.  The  CM-steps  may  or  may  not  have  closed 
forms,  but  since  they  are  maximized  over  smaller  dimensions,  they  are  often  simpler 
and  require  less  time  to  maximize  when  iteration  is  required.  Liu  and  Rubin  (1994) 
proposed  a generalization  of  the  ECM  algorithm  called  the  ECM  Either  algorithm. 
In  the  ECME  algorithm,  certain  CM-steps  are  modified  such  that  the  expectation 
is  taken  over  the  constrained  observed  data  likelihood  instead  of  the  constrained  ex- 
pected complete  data  likelihood.  Hence  the  name  ECM  Either  arises  from  alternating 
between  either  the  complete  or  observed  data  likelihoods  within  the  CM-steps.  Liu 
and  Sun  (1997)  considered  the  implementation  of  the  ECME  algorithm  to  acceler- 
ate the  EM  algorithm  for  mixture  distributions.  We  now  modify  their  approach  to 
accelerate  the  NPML  EM  algorithm  discussed  in  the  previous  section. 

The  ECME  version  of  the  NPML  EM  algorithm  consists  of  a single  E-step  and 
two  CM-steps.  The  single  E-step  is  the  same  as  that  for  the  original  EM  algorithm, 
in  which  we  calculate  the  expectation  defined  in  (4.4).  The  M-step,  however,  is 
separated  into  two  CM-steps.  In  the  first  CM-step  we  update  j3*  by  maximizing  the 
expected  conditional  complete  data  likelihood.  Hence  the  first  CM-step  corresponds 
to  the  same  maximization  over  (3 * that  was  performed  in  the  original  M-step.  In  the 
second  CM-step  we  update  p,  this  time  by  maximizing  the  observed  data  likelihood 


143 


with  (3*  fixed  at  its  current  estimate.  Thus,  instead  of  using 

Pk]  = ±<41 /n. 

1=1 

to  update  p,  we  update  p by  maximizing  the  constrained  observed  log-likelihood 

n K 

L0bs{p)  = i°g  Pk  I &*)■  (4.6) 

i= 1 fc=l 

There  are  many  optimizations  methods  that  could  be  used  to  maximize  (4.6).  Since 
the  first  and  second  derivatives  of  L0bs  with  respect  to  p are  relatively  simple  to 
calculate,  we  used  a Newton-Raphson  algorithm  for  maximizing  (4.6).  Given  a current 
estimate  p(°),  a new  estimate  is  obtained  as  follows: 

p(»™)  = £(o)_  sn-i  g,  (4.7) 

where  0 < £ < 1 and  g and  H denote  the  gradient  vector  and  Hessian  matrix,  re- 
spectively. Let  fik  = /( yijk  | (3*),  where  the  product  is  over  the  T{  observations 

in  the  *th  cluster  with  the  kth  dummy  variable  corresponding  to  mass  point  k equal 
to  one.  For  the  observed  log-likelihood  (4.6),  the  gradient  vector  and  Hessian  matrix 
have  the  following  sth  and  (r,  s)th  elements: 


gs 


dL0bs(pl,  • • • ,pk-i) 
dps 


E 


fis  fiK 
K 

Pk  fik 

k= 1 


Hrs  = 


d 2 LpbsjPl,  ' ' ' ,PK- 1) 
dpr  dps 


= -E 


( fir  fil<){fis  fix') 


K 


i=l 


(E  Pkfik )2 

k= 1 


forl<r,s<iF  — 1.  Since  the  parameters  p in  (4.6)  are  constrained  to  be  between 
zero  and  1.0,  the  Newton-Raphson  algorithm  (4.7)  may  overshoot  the  feasible  region 
of  allowable  parameter  points.  The  step  scaling  factor  S in  (4.6)  can  be  decreased 
from  the  full  step  (5  = 1)  to  partial  steps  (<5  < 1)  when  such  violations  occur.  In  our 


144 


experience,  the  ECME  algorithm  by  itself  can  speed  convergence  dramatically  over 
the  basic  EM  algorithm.  This  is  especially  true  when  the  mass  points  are  not  well 
separated. 

In  a comprehensive  summary  of  the  EM  algorithm  within  the  context  of  mixtures, 
Redner  and  Walker  (1984)  noted  that  further  research  was  needed  in  developing 
algorithms  '‘...which  first  take  advantage  of  the  good  global  convergence  properties  of 
the  EM  algorithm  by  using  it  initially  and  then  exploits  the  rapid  local  convergence 
of  Newton’s  method  or  one  of  its  variants  by  switching  to  such  a method  later.” 
Indeed,  even  when  the  EM  algorithm  is  very  near  the  final  estimates,  convergence 
can  be  extremely  slow  due  to  its  linear  convergence  rate.  Thus,  the  second  approach 
we  mention  for  speeding  the  convergence  rate  of  the  mixture  EM  algorithm  starts 
with  the  EM  algorithm  but  then  switches  to  a faster  converging  algorithm.  Such 
an  approach  has  been  utilized  by  Follmann  and  Lambert  (1989)  for  fitting  mixtures 
of  logistic  regression  models,  and  by  Aitkin  and  Aitkin  (1995)  within  the  context  of 
mixtures  of  normals.  We  also  use  this  approach  for  the  simulation  studies  in  Section 
4.5. 

Any  number  of  algorithms  can  be  used  in  conjunction  with  the  EM  algorithm. 
We  utilize  the  Broyden-Fletcher-Goldfarb-Shanno  (BFGS)  algorithm  which  was  de- 
scribed in  Section  3.3.1.  Recall  that  it  has  the  same  structure  as  (4.7),  but  uses 
approximations  for  the  gradient  vector  and  Hessian  matrix  so  that  one  does  not  need 
to  program  the  first  and  second  derivatives  of  the  likelihood  function.  There  are  no 
specific  rules  as  to  when  one  should  stop  the  EM  algorithm  and  switch  to  the  BFGS 
algorithm.  Follmann  and  Lambert  (1989)  switched  algorithms  when  two  successive 
parameter  iterates  were  close.  For  our  simulation  studies,  we  found  that  switching 
algorithms  when  the  difference  between  two  successive  log-likelihood  values  was  less 
than  0.05  had  no  adverse  effect  on  where  the  algorithm  converged.  That  is,  the  final 
estimates  obtained  in  the  combination  algorithm  were  those  that  the  original  EM 


145 


algorithm  converged  to  as  well.  In  practice  one  would  try  a range  of  values  for  deter- 
mining when  to  switch  algorithms.  This  approach  can  be  combined  with  any  of  the 
EM  variants  as  well.  We  coupled  the  ECME  algorithm  with  the  BFGS  algorithm  for 
the  simulation  studies,  which  resulted  in  tremendous  savings  in  computation  time. 
For  example,  using  a Unix  Ultra  1 workstation  with  1024  MB  of  RAM  and  a 400  MHz 
processor,  the  time  required  to  fit  500  simulated  datasets  was  on  the  order  of  five  days 
for  the  original  EM  algorithm,  while  on  the  order  of  12  hours  for  the  combination 
ECME-BFGS  algorithm. 

Discussion 

There  are  a number  of  other  factors  that  can  influence  the  convergence  rate  of  the 
algorithm,  or  even  whether  the  algorithm  converges  at  all.  The  initial  starting  values 
for  the  regression  parameter,  mass  points,  and  masses  often  determine  if  the  algorithm 
will  converge.  Initial  estimates  for  the  regression  parameters  can  be  obtained  by 
fitting  a model  that  excludes  the  random  effect.  One  could  also  use  the  maximum 
likelihood  estimates  from  one  of  the  approaches  in  the  previous  chapter,  such  as 
the  pseudo  likelihood  approach  or  full  ML  approach.  There  have  been  a number  of 
suggestions  for  determining  initial  estimates  of  the  masses  and  mass  points.  Wood  and 
Hinde  (1987)  and  Follmann  and  Lambert  (1989)  suggested  starting  values  for  the  mass 
points  based  on  the  histogram  of  the  residuals  from  the  fixed  effects  model.  Aitkin 
(1996,  1999)  suggested  using  the  weights  and  nodes  from  Gauss-Hermite  quadrature 
as  initial  estimates  for  the  masses  and  mass  points,  respectively.  Butler  and  Louis 
(1992)  generated  initial  estimates  of  the  mass  points  from  {logit(-^-j- ),  v = 1,  • • • , K}. 
No  single  method  will  guarantee  convergence  to  the  global  maximum.  It  is  best  to 
try  more  than  one  set  of  starting  values  to  verify  that  the  global  maximum  has  been 
reached.  From  our  own  experience,  we  have  had  success  using  the  starting  values 
proposed  by  Aitkin  (1996,  1999)  and  Butler  and  Louis  (1992). 


146 


When  applying  the  NPML  algorithm,  convergence  must  be  obtained  within  a 
fit  when  the  support  size  K is  fixed,  and  then  between  successive  fits  when  the 
support  size  is  increased.  Convergence  within  a fit  can  be  determined  by  monitoring 
the  change  in  parameter  estimates  and  the  change  in  deviance  (Aitkin  1996,  1999). 
If  one  is  using  the  alternative  MCEM-BFGS  algorithm,  convergence  is  determined 
within  the  quasi-Newton  algorithm  where  changes  in  the  log-likelihood  and  gradients 
of  the  log-likelihood  are  monitored.  When  convergence  is  obtained  within  a fit,  the 
support  size  is  increased  and  the  model  is  refit.  There  are  a number  of  ways  to 
determine  if  the  optimal  support  size  has  been  reached.  Typically  an  increase  in  the 
support  size  beyond  the  optimal  value  leads  to  multiplicities  in  mass  points,  or  masses 
with  zero  probabilities.  In  conjunction  with  these  occurrences,  there  is  usually  little 
to  no  change  in  the  deviance  between  the  successive  fits.  Thus  one  can  determine 
convergence  in  K by  comparing  deviances  between  fits.  Occasionally,  however,  an 
increase  in  the  support  size  beyond  the  optimal  value  will  lead  to  singular  matrices 
within  the  Fisher  Scoring  algorithm  as  mass  points  take  on  identical  values.  The 
deviance  would  then  be  undefined,  but  the  choice  of  K would  be  obvious. 

An  alternative  method  for  finding  the  optimal  K is  by  overfitting  the  support  size 
(Aitkin  1996,  1999).  With  this  approach,  one  successively  reduces  K from  some  large 
starting  point  until  the  true  K is  reached.  We  utilized  this  approach  in  the  simulation 
studies  in  Section  4.5.  It  has  been  our  experience,  and  the  experience  of  others  (Wood 
and  Hinde  1987;  Follmann  and  Lambert  1989;  Aitkin  1996,  1999),  that  the  optimal 
support  size  K is  typically  small,  falling  somewhere  between  two  and  six,  even  when 
the  true  mixing  distribution  is  continuous.  One  notable  exception  was  reported  by 
Davies  and  Pickles  (1987),  who  analyzed  primary  and  secondary  shopping  behavior 
for  275  households  in  England.  Davies  and  Pickles  (1987)  proposed  a model  based  on 
the  inverse  Gaussian  density  and  the  negative  exponential  density,  which  included  two 
nuisance  parameters.  They  assumed  that  the  bivariate  distribution  of  the  nuisance 


147 


parameters  was  concentrated  at  a finite  set  of  two-dimensional  mass  points.  Using  an 
NPML  approach  to  maximize  the  log-likelihood,  Davies  and  Pickles  (1987)  found  that 
the  NPML  estimate  required  18  mass  points  to  fully  characterize  the  nonparametric, 
bivariate  mixing  distribution.  In  Chapter  5 we  discuss  the  extension  of  the  existing 
algorithm  to  the  bivariate  case. 

As  noted  by  Wood  and  Hinde  (1987)  and  from  our  own  experience,  it  is  possible 
that  the  maximum  likelihood  estimate  of  the  mixing  distribution  has  positive  proba- 
bility at  m = ± oo.  Indeed,  in  their  NPML  algorithm  for  binary  response  data,  Wood 
and  Hinde  (1987)  included  mass  parameters  at  m = ±oo  by  default.  To  do  so,  they 
assumed  that  mixing  distribution  G was  of  the  form 

K 

u\  8{m  — — oo)  + o>2  S(m  = oo)  + pk  5(m  = rrik ) 

k=i 

where  u>i  + w2  + ]Cfc=i  Pk  — 1 and  5(m  = mk)  denotes  a mass  point  at  m — mk. 
The  occurrence  of  mass  points  at  plus  or  minus  infinity  depends  on  the  distribution 
of  the  response  profiles  for  the  clusters,  with  clusters  having  all  the  same  response 
contributing  to  the  event.  In  a dataset  of  voting  histories  where  subject’s  voting 
profiles  were  recorded  over  time,  Wood  and  Hinde  (1987)  interpreted  a mass  point 
at  ±oo  as  those  people  who  are  certain  to  vote  (or  not  to  vote)  for  a particular 
party.  As  inclusion  of  such  parameters  in  the  model  would  remove  the  model  from 
the  generalized  linear  model  framework,  we  did  not  consider  such  additions.  Our 
own  experience  of  mass  points  at  plus  or  minus  infinity  occurred  only  in  simulated 
datasets  where  they,  in  those  instances,  had  no  adverse  effect  on  the  estimates  of 
the  fixed  regression  parameters.  Standard  errors,  however,  were  greatly  influenced 
as  they  are  obtained  through  inversion  of  the  observed  information  matrix.  Having 
an  extremely  large  (in  absolute  value)  mass  point  at  the  least  caused  the  variance 
estimate  to  be  negative  for  that  mass  point.  When  this  occurred,  surprisingly,  the 
standard  errors  of  the  fixed  regression  parameters  were  still  reasonable.  Other  times, 


148 


however,  the  observed  information  matrix  became  uninvertable  leaving  no  estimates 
of  the  standard  errors  for  the  parameters.  If  this  occurred  in  the  analysis  of  an  actual 
dataset,  one  might  wish  to  fit  a model  such  as  that  used  by  Wood  and  Hinde  (1987), 
which  can  account  for  mass  points  at  plus  or  minus  infinity. 

A final  observation  that  we  have  made  concerns  the  incorporation  of  the  mass 
points  into  the  fixed  effects  design  matrix.  Recall  that  this  was  accomplished  by 
expressing  the  K mass  points  as  a set  of  K — 1 dummy  variables  along  with  the  fixed 
covariates.  An  alternative  is  to  include  all  K dummy  variables  and  then  fit  a no 
intercept  model.  We  have  found  that  this  approach  often  leads  to  faster  convergence 
rates.  One  reason  for  this  is  the  mass  point  estimates  are  able  to  move  about  more 
freely,  no  longer  being  estimated  as  changes  from  the  intercept/baseline  mass  point. 

4.2.2  Inference 

We  now  consider  inference  within  the  context  of  the  NPML  approach.  Ideally, 
one  would  like  to  have  an  asymptotic  theory  for  the  estimation  of  the  fixed  effects 
parameters  that  would  parallel  the  standard  maximum  likelihood  theory.  This  would 
provide  one  with  a variety  of  inferential  techniques,  such  as  information  matrix  calcu- 
lations of  standard  errors,  as  well  as  the  asymptotic  justification  for  the  use  of  these 
methods.  Unfortunately  such  asymptotic  theory  is  still  lacking,  which  is  a major 
deficiency  of  the  NPML  approach.  The  difficulty  in  deriving  the  asymptotic  theory 
arises  from  the  unknown  support  size  for  the  mixing  distribution.  When  the  support 
size  is  unknown,  the  dimension  of  the  parameter  vector  is  also  unknown  making  large 
sample  results  difficult  to  obtain.  Despite  this,  the  majority  of  the  recommenda- 
tions for  making  inference  on  the  fixed  effects  parameters  are  based  on  the  standard 
maximum  likelihood  theory.  In  this  section  we  consider  the  calculation  of  standard 
error  estimates  for  the  fixed  parameters  and  the  mixing  parameters  in  addition  to 
likelihood  inference  for  the  NPML  approach.  We  obtain  standard  errors  through  the 
calculation  of  the  observed  information  matrix,  with  the  required  second  derivatives 


149 


given  below.  We  then  consider  hypothesis  testing  for  the  fixed  parameters  and  the 
mixing  distribution  as  well  as  for  model  comparisons. 

Standard  errors 

There  have  been  a number  of  suggestions  for  obtaining  standard  errors  within  the 
context  of  the  NPML  approach  (Follmann  and  Lambert  1989;  Butler  and  Louis  1992; 
Dietz  and  Bohning  1995;  Aitkin  1996,  1999).  Follmann  and  Lambert  (1989)  and  But- 
ler and  Louis  (1992)  obtained  standard  errors  by  calculating  the  observed  information 
matrix,  which  we  will  discuss  in  detail  below.  Dietz  and  Bohning  (1995)  suggested 
three  alternative  estimators  for  the  standard  errors  which  they  called  the  Profile 
Likelihood  (PL)  estimator,  the  Likelihood-ratio  (LR)  estimator,  and  the  Multiple- 
Imputation  (MI)  estimator. 

Let  (3  denote  a single  parameter  and  (3  its  maximum  likelihood  estimator.  Also  let 
lp=w  be  the  supreme  of  the  log-likelihood  function  given  that  /3  — w.  The  PL  estimator 
is  primarily  an  estimator  of  confidence  intervals.  Consider  the  values  /3+  of  /3  that  fall 
inside  the  95  percent  confidence  interval  defined  by  the  condition  2 (lp-p  — lp=p+)  < 
Xo.95-  The  bounds  on  the  interval  can  be  calculated  by  computing  the  profile  likelihood 
for  an  appropriate  grid  of  (3  values.  Dietz  and  Bohning  (1995)  argued  that  if  the  PL 
confidence  interval  is  compact  and  approximately  symmetric  with  respect  to  /5,  the 
standard  error  of  (3  can  be  approximated  by 


SEpl(/3) 


A - A 
2*1.96’ 


(4.8) 


where  A and  A,  are  the  upper  and  lower  bound  of  the  PL  interval,  respectively. 
Dietz  and  Bohning  (1995)  noted  that  the  compactness  and  symmetry  requirements 
are  commonly  met  if  the  sample  size  is  sufficiently  large. 

The  second  estimator,  the  LR  estimator,  is  based  on  the  property  that  in  large 
samples  from  models  for  which  the  log-likelihood  is  quadratic  in  the  parameters,  the 
likelihood-ratio  and  Wald  tests  for  the  significance  of  an  individual  parameter  are 


150 


equivalent  (Dietz  and  Bohning  1995).  For  testing  a single  parameter  /3  = 0,  the 
likelihood-ratio  and  Wald  statistics  are  given  by 


^lr  — 2(/^=0  l /3=/s)  A — 


l2 

Var(/?) 


(4-9) 


respectively.  Since  the  Wald  test  and  likelihood-ratio  test  are  equivalent  asymptoti- 
cally, one  can  equate  the  two  equations  in  (4.9)  to  yield  the  standard  error  estimate 

1 1/2 


SEl*03)  = 


/ 3 2 

2(^=/g  — lp=o) 


(4.10) 


This  approach  is  also  advocated  by  Aitkin  (1996,  1999).  Even  if  the  log  likelihood  is 
skewed  where  equivalence  of  the  Wald  and  likelihood-ratio  tests  may  not  hold,  Aitkin 
(1996,  1999)  argued  that  estimate  (4.10)  is  still  more  appropriate  than  an  estimate 
based  on  the  inverse  information  matrix.  He  reasoned  that  use  of  (4.10)  in  the  Wald 
statistic,  A w,  given  in  (4.9)  leads  to  the  likelihood-ratio  statistic,  A lr,  which  reflects 
the  skewness  of  the  log-likelihood.  Using  the  inverse  information  matrix  estimate  of 
the  standard  error  in  Xw  would  be  misleading  as  it  would  not  account  for  the  skewed 
log  likelihood. 

The  final  estimator  suggested  by  Dietz  and  Bohning  (1995)  is  the  Multiple- 
Imputation  (IM)  estimator.  In  this  approach,  one  augments  the  original  data  with 
simulated  membership  data  denoting  which  of  the  K classes  the  particular  obser- 
vation came  from.  The  data  are  simulated  using  the  parameter  estimates  and  esti- 
mated weights  aik  from  the  final  iteration  of  the  NPML  algorithm.  For  each  of  m 
such  simulated  samples,  the  maximum  likelihood  estimate  and  standard  error  of  (3 
are  estimated.  Denoting  the  standard  error  estimate  for  the  <?th  sample  as  aq,  the  MI 
estimator  for  the  variance  of  (3  is  given  by 


Var„,(/3)  = -VV. 

m 


(4.11) 


151 


Dietz  and  Bohning  (1995)  reported  a small  simulation  study  comparing  the  three 
approaches  for  calculating  standard  errors.  They  concluded  that  all  three  provided 
adequate  estimates,  though  noting  that  the  LR  estimator  tended  to  be  conservative. 

A major  deterrent  for  the  use  of  the  PL,  LR,  and  MI  estimators  is  the  additional 
computation  required  to  obtain  the  estimates.  To  obtain  the  profile  confidence  inter- 
val for  the  PL  approach,  one  must  fit  repeated  models  over  a grid  of  possible  /?’ s to 
determine  the  upper  and  lower  confidence  bounds.  Such  calculations  would  require 
an  extremely  fast  algorithm  such  as  the  directional  derivative  approach  of  Lesperance 
and  Kalbfleisch  (1992).  The  LR  estimator  also  requires  additional  model  fitting.  For 
each  parameter,  one  must  fit  the  reduced  model  excluding  that  parameter  to  calculate 
the  standard  error.  This  approach  is  implemented  by  the  statistical  package  GLIM4, 
which  Aitkin  (1996,  1999)  has  used  for  fitting  his  NPML  models  and  may  be  a reason 
why  he  advocated  the  LR  estimator  (4.10)  approach.  The  MI  estimator  involves  an 
additional  m fits  after  obtaining  the  maximum  likelihood  values  for  the  parameters. 
Dietz  and  Bohning  (1995)  used  m = 1000  for  their  simulation  study  though  they  do 
not  comment  on  the  amount  of  time  needed  to  fit  the  m samples.  Even  with  the 
current  computing  power,  we  would  guess  this  would  require  an  exorbitant  amount 
of  time. 

For  the  NPML  algorithm  we  have  proposed,  we  obtain  estimates  of  the  standard 
errors  by  calculating  the  observed  information  matrix.  As  the  NPML  algorithm  is 
based  on  the  EM  algorithm,  one  method  for  finding  the  observed  information  matrix  is 
Louis’  method  (Louis  1982),  as  outlined  in  Chapter  3.  Butler  and  Louis  (1992)  utilized 
this  approach  when  they  performed  their  simulation  studies.  Alternatively,  one  can 
calculate  the  observed  information  matrix  directly,  by  evaluating  at  the  maximum 
likelihood  estimates  the  inverse  of  the  second  derivative  of  the  log-likelihood  function. 
We  now  provide  the  necessary  derivatives  for  such  calculations. 


152 


For  notational  convenience,  we  denote  the  log-likelihood  function  as 

n n K 

1 = ?*/<*>  (4-12) 
t=i  t=i  fc=i 

where  fik  = nj=i  fiYijk  I ft*)-  Then,  letting  \F  = (ft*  , p’)  be  the  vector  of  parame- 
ters, the  observed  information  matrix  is 


Fo(tf)  = - 


d2l 


For  the  log-likelihood  function  (4.12), 

dl 


= E 


pft'ft’ 

r o 

fft  p 

rO 

Fpft' 

ro 

Fpp 

\ — fiK 

dpk  ' ' x 

1 - 1 £ pi  f« 

i=i 


and 


dl 

dft* 


n K 


= ££“*  SAP*), 


»=i  *:=! 


(4.13) 


(4,14) 


(4.15) 


where  ^ — , and  sik(ft*)  denotes  the  contribution  to  the  score  function 

Ej=i  Pi  fn 

for  the  ith  cluster  in  the  A:th  component.  Recall  that  the  form  of  the  score  function 

for  a particular  link  function  is  given  in  Chapter  2.  Given  (4.14)  and  (4.15),  and 

denoting  the  contribution  to  the  observed  information  matrix  for  the  zth  cluster  in 

the  fcth  component  by  F0,ik  as  defined  in  Chapter  2,  the  elements  of  (4.13)  defined 

3* 

as  follows.  The  K — 1 by  1 vector,  Fq  p,  has  elements 

pft’pi.  _ d2l  _ _ v ^ f f Ojfe  _ Qjj k _ A _ _d/_  dl_  1 

dft* dp k " t V Pk  Slk  Pk  SlK ) dpk  d/3  } ' 

FqP  is  a K — lby  K — 1 matrix  with  (r,  s)th  element 


ITPrPs  

rO  — 


d2l 

dprdps 


£ 


d^dl_ 

clpr  dps 


153 


and 


F0-0-  = <ei 

° dp’dp"' 


n K ( / K \ 

EE  ®-ik  F'ojk  T QJjfc  S ik  Sjfc  ®ifc  I ^ 'J  &il  &il  1 

t=l  fc=l  l \i=l  / 


Upon  convergence  of  the  NPML  algorithm,  the  observed  information  matrix  is 
evaluated  at  the  maximum  likelihood  estimates  of  the  fixed  parameters  and  mixing 
distribution  and  then  inverted  to  obtain  an  estimated  variance-covariance  matrix  for 
the  parameters.  We  note  that  missing  from  the  observed  information  matrix  is  the 
support  size  parameter  K.  Follmann  and  Lambert  (1989)  used  a small  simulation 
study  to  show  that  calculation  of  the  standard  errors  by  assuming  that  K was  the 
true  support  size  is  appropriate  even  if  K ^ K.  They  argued  that  the  variability  in  an 
estimated  parameter  ft  depends  on  the  variability  in  the  mixing  distribution  G which 
can  be  captured  by  G even  if  K ^ K.  In  their  simulation  they  compared  Monte  Carlo 
estimates  of  standard  errors  to  those  obtained  from  the  observed  information  matrix 
where  the  true  mixture  was  normal  and  a two  point  mixture.  Even  in  the  normal 
case  where  the  estimated  K is  far  from  the  true  continuous  distribution,  the  two 
estimates  of  the  standard  errors  were  close.  In  our  simulation  study  in  Section  4.5.1, 
we  examine  the  performance  of  the  observed  information  matrix  standard  errors  as 
well,  under  a variety  of  mixture  distributions. 

Hypothesis  testing  and  model  comparisons 

If  the  support  size  K were  known,  hypothesis  testing  for  the  fixed  parameters  and 
model  comparisons  could  be  carried  out  using  the  standard  likelihood-ratio  test  as 
defined  in  (4.9).  For  the  application  of  mixtures  at  hand,  the  support  size  is  an  un- 
known parameter  and  must  be  estimated  from  the  data.  The  asymptotic  distribution 
of  the  likelihood-ratio  test  in  this  context  is  still  unknown.  Consider  the  problem  of 
testing  a single  fixed  parameter  equal  to  zero.  In  a fixed  effects  model,  the  distri- 
bution of  the  likelihood-ratio  statistic  would  be  x2  with  one  degree  freedom,  since 


154 


the  difference  in  parameters  between  the  null  and  the  alternative  hypotheses  is  one. 
In  the  models  considered  here,  the  difference  in  parameters  between  the  null  and 
the  alternative  need  not  be  one.  If,  for  instance,  an  additional  mass  parameter  was 
needed  under  the  alternative  hypothesis,  the  difference  in  parameters  would  be  three: 
one  for  the  fixed  parameter,  one  for  the  mass  point  m,  and  one  for  the  mass  p.  In 
this  instance,  one  could  view  the  additional  mass  p in  the  alternative  as  being  zero  in 
the  null  hypothesis,  however  the  asymptotic  theory  for  the  likelihood-ratio  test  fails 
again  as  p = 0 is  on  the  boundary  of  the  parameter  space. 

Despite  the  lack  of  theoretical  justification  for  its  use,  the  likelihood-ratio  test 
has  been  commonly  used  for  testing  fixed  parameters  and  making  model  comparisons 
(Davies  1987;  Wood  and  Hinde  1987;  Aitkin  1996,  1999).  Surprisingly,  justification 
for  its  use  has  generally  come  from  a single  simulation  study  by  Davies  (1987).  Davies 
(1987)  suspected  that  the  flexibility  of  the  NPML  approach  in  accounting  for  variation 
between  sampled  individuals  might  make  it  difficult  to  detect  systematic  variation  due 
to  explanatory  variables.  He  conjectured  that  the  power  of  the  likelihood-ratio  test 
would  be  lower  for  the  NPML  approach  as  compared  with  equivalent  tests  in  para- 
metric approaches.  To  test  this  claim,  Davies  (1987)  conducted  a simulation  study 
using  a negative  binomial  model.  The  negative  binomial  distribution  is  commonly 
used  for  modeling  Poisson  response  data  that  exhibit  overdispersion.  Given  that  a 
count  yi  | A i is  distributed  Poisson  with  mean  the  negative  binomial  model  is 
obtained  by  assuming  the  rate  parameter  A;  follows  a Gamma  distribution  with  pa- 
rameters a and  b.  The  resulting  marginal  distribution  of  pi  is  negative  binomial  with 
mean  ab  and  variance  a(b  + 1)|.  Davies  (1987)  fixed  a and  b , and  assumed  that 
the  count  pi  was  dependent  on  a single  covariate.  He  then  simulated  from  the  true 
negative  binomial  model  500  datasets  with  the  coefficient  of  the  covariate  /3  = 0 and 
100  datasets  for  each  of  /?  = 0.1, 0.2,  • ■ • ,0.7.  Davies  (1987)  tabulated  the  rejection 


155 


rates  for  testing  the  null  hypothesis  of  = 0 from  both  the  parametric  and  non- 
parametric  fits.  He  concluded  that  there  was  no  evidence  of  the  likelihood-ratio  test 
being  distributed  differently  for  mass  point  models.  In  addition,  any  loss  of  power 
through  the  use  of  a nonparametric  approach  was  too  small  to  be  detected  by  the 
simulation  study.  Davies  (1987)  noted  that  the  support  sizes  ranged  from  three  to 
five  throughout  the  simulation  study.  However,  it  was  not  clear  whether  the  number 
of  mass  points  changed  under  the  null  and  alternative  hypothesis  for  a given  dataset. 
In  Section  4.5.2  we  conduct  a similar  simulation  study  to  investigate  the  rejection 
rate  of  the  likelihood-ratio  test  for  the  NPML  approach  compared  to  the  parametric 
approach  of  the  previous  chapter. 

An  alternative  to  the  likelihood-ratio  test  for  testing  a fixed  parameter  is  the  Wald 
test,  given  in  (4.9).  Under  the  usual  regularity  conditions  (Rao  1973,  p.  364),  the 
estimates  of  the  parameters,  being  maximum  likelihood  estimates,  are  asymptotically 
normal.  Given  an  estimate  of  the  asymptotic  covariance  matrix,  as  defined  in  the 
previous  section,  the  ratio  of  the  square  of  the  parameter  estimate  and  the  variance  of 
the  estimator  has,  asymptotically,  a x2  distribution  with  one  degree  of  freedom.  Use 
of  the  Wald  statistic  in  the  context  of  the  models  considered  here  has  been  limited,  due 
to  the  lack  of  exact  asymptotic  standard  errors.  Though  this  is  true,  one  could  argue 
that  use  of  the  Wald  statistic  is  no  more  incorrect  than  the  use  of  the  likelihood-ratio 
statistic.  The  Wald  statistic  is  also  unaffected  by  the  potential  difference  in  support 
sizes  between  the  null  and  alternative  hypotheses  as  in  the  likelihood-ratio  test.  Thus, 
we  also  examine  the  performance  of  the  Wald  test  in  the  simulation  study  in  Section 
4.5.2. 

As  in  fixed  parameter  hypothesis  testing,  the  ability  to  make  formal  quantitative 
inferences  concerning  the  mixing  distribution  G is  well  behind  our  ability  to  estimate 
it  and  use  it  informally.  Though  there  has  been  considerable  research  in  the  area, 


156 


many  of  the  basic  properties  of  the  maximum  likelihood  estimate  of  G are  still  un- 
known. Certain  results  concerning  G have  been  established,  but  only  under  special 
circumstances.  For  example,  the  asymptotic  distribution  of  G is  unknown,  however 
Lindsay  (1989)  showed  that  the  moments  of  G can  be  estimated  with  the  usual  root-n 
asymptotics.  Of  main  interest  for  the  application  of  mixtures  considered  here,  is  to 
test  the  heterogeneity  model  versus  one  not  including  a random  effect.  In  terms  of 
the  parameters  of  the  mixing  distribution,  one  would  test  that  the  mixing  propor- 
tions are  zero.  As  before,  this  entails  testing  that  the  masses  are  on  the  boundary  of 
the  parameter  space  which  precludes  the  use  of  the  likelihood-ratio  test.  By  simula- 
tion, Bohning  et  al.  (1994)  showed  for  mixtures  of  densities  from  the  one-parameter 
exponential  family  that  the  distribution  of  the  likelihood-ratio  statistic  for  testing 
homogeneity  versus  a two  component  mixture  was  similar  to  a mixture  of  Xi>  and 
Xo-  Even  so,  a number  of  authors  use  the  change  in  deviance  between  the  homogene- 
ity and  heterogeneity  models  as  a guideline  for  making  such  decisions  (Aitkin  1996, 
1999). 

4.3  Identifiability 

When  considering  any  mixtures  of  probability  distributions,  the  problem  of  iden- 
tifiability is  of  great  importance.  A given  mixture  is  identifiable  if  it  is  uniquely 
characterized  in  the  sense  that  two  distinct  sets  of  parameters  defining  the  mixture 
can  not  yield  the  same  distribution.  In  this  section  we  consider  the  identifiability  of 
the  nonparametric  mixture  models  proposed  in  the  previous  sections.  Specifically, 
we  give  sufficient  conditions  for  the  identifiability  of  the  mixture  multinomial  random 
effects  model,  when  the  mixing  is  over  a single  multinomial  distribution.  Such  models 
can  be  considered  as  overdispersed  multinomial  models.  We  have  already  considered 
such  a model  in  Section  3.6.2  where  we  fit  a shifted  threshold  model  to  the  toxicity 
dataset.  Recall  that  a multinomial  response  vector  was  recorded  for  each  litter  of 
mice.  By  fitting  a shifted  threshold  model,  we  accounted  for  possible  overdispersion 


157 


among  the  multinomial  response  vectors  of  the  litters.  The  results  shown  in  this  sec- 
tion are  based  on  extensions  of  results  by  Teicher  (1963,  1967).  An  excellent  review 
of  identifiability  and  its  many  applications  can  be  found  in  Prakasa  Rao  (1992). 

Our  goal  in  this  section  is  to  provide  conditions  under  which  overdispersed  multi- 
nomial models  are  identifiable.  An  explicit  definition  of  identifiability  in  the  context 
of  these  models  and  conditions  for  their  identifiability  will  be  given  in  Section  4.3.2. 
To  arrive  at  such  conditions  we  first  consider  the  identifiability  of  mixtures  of  multi- 
nomial distributions  and  the  mixtures  of  products  of  multinomial  distributions  in 
Section  4.3.1.  These  results  are  extensions  of  results  by  Teicher  (1963,  1967)  where 
mixtures  of  binomial  distributions  were  considered.  As  in  the  case  of  binomial  mix- 
tures, the  class  of  all  finite  mixtures  of  multinomial  distributions  is  not  identifiable. 
We  will  show,  however,  that  certain  subsets  of  this  family  are  identifiable  when  cer- 
tain conditions  are  met.  To  determine  which  subsets  are  identifiable,  we  first  prove 
Theorem  4.1,  a generalization  of  Proposition  3 in  Teicher  (1963),  which  character- 
izes identifiable  mixtures  of  general  multinomial  distributions.  This  leads  directly  to 
necessary  and  sufficient  conditions  for  the  identifiability  of  mixtures  of  multinomial 
distributions  with  fixed  sample  sizes.  We  then  show  that  under  certain  conditions, 
products  of  multinomial  distributions  are  also  identifiable. 

4.3.1  Mixtures  of  Multinomial  and  Product  Multinomial  Distributions 

In  this  section  we  consider  the  identifiability  of  mixtures  of  multinomial  distribu- 
tions and  products  of  multinomial  distributions.  We  begin  with  an  example  which 
will  illustrate  the  concept  of  identifiability. 

Example  4.1 

In  this  example  we  show  that  mixtures  of  two  binomial  distributions  with  sample 
size  two  are  not  identifiable.  Let  B( 2,7t)  denote  the  binomial  distribution  with  two 
trials  and  success  probability  it,  0 < n < 1.  Let  Gnun,2ta  be  a mixing  distribution 
with  P( 7r  = m)  = a = 1 - P(n  = 7 r2),  where  7Ti  ^ n2  and  0 < a < 1.  Let  X denote 


158 


a random  variable  with  the  distribution  that  is  a mixture  of  B(2,ir)  with  respect  to 
the  mixing  distribution  GVli7r2jQ.  Then 

P*i,*2 ,a(x  = o)  = a (1  - 7TJ)2  + (1  - a)(l  - 7 r2)2,  (4.16) 

Pirun2,a(X  = 1)  = 2a7Ti(l  - 7Ti)  + 2mr2(l  - 7T2),  (4-17) 

and 


PwiAx  ~ 2)  = cot2  + (1  - oOtt2. 

Since  X)i=o  ^ri,5r2,a(-X’  = *)  = 1 only  equations  (4.16)  and  (4.17)  are  needed  to  deter- 
mine P7ri)^2)Q(X  = i)  for  i = 0,1,  2. 

The  mixture  of  two  binomials  13(2, 7Ti)  and  13(2, 7r2)  with  respect  to  GVu7r2^a  is  said 
to  be  identifiable  if 


Pni,n2,a{X  — 0)  — P^ ,ct*  (X  — 0)  and  Pni,TT2,a{X  — 1)  — P^*  l7rJ,a»  (X  — 1)  (4.18) 

implies  that  (7^,  7r2,  a)  = (7^,  7^,  a*).  (4.19) 


Since  (4.16)  and  (4.17)  are  a set  of  two  equations  with  three  unknowns  (ni,  7r2,  a), 
there  are  infinitely  many  solutions  for  a given  pair  of  values  for  P7ri)7r2i„(X  = 0)  and 
Pit  1 ,ir2,a  {X  = 1).  For  example,  two  such  solutions  that  yield  the  same  set  of  values 
are  (7Ti,7r2,Q!)  = (0.5,  0.6,  0.5)  and  (7Ti,7r2,a:)  = (0.3,0.56,0.038462).  Therefore,  since 
(4.18)  does  not  imply  (4.19),  the  family  of  mixtures  of  two  binomials  P(2,7Ti)  and 
B(2, 7 r2)  is  not  identifiable. 

Example  4.1  is  a special  case  of  a result  proven  by  Teicher  (1963).  Teicher  (1963) 
showed  that  for  the  family  of  binomial  distributions  with  fixed  sample  size  n and 
parameter  7r,  a necessary  and  sufficient  condition  for  the  class  of  all  finite  mixtures 
of  at  most  k elements  of  the  family  to  be  identifiable  is  that  n > 2k  - 1.  In  Example 
4.1,  n and  k were  both  two  and  so  n ^ 2k  — 1.  The  proof  of  this  result  follows  a 
similar  argument  to  that  used  in  Example  4.1.  In  general,  condition  (4.18)  states 


159 


Table  4.1:  Total  number  of  multinomial  vectors  y for  multinomial  sample  size  n and 
number  of  probabilities  R 


n 

R Probabilities 

2 

3 

4 

5 

1 

2 

3 

4 

5 

2 

3 

6 

10 

15 

3 

4 

10 

20 

35 

4 

5 

15 

35 

70 

5 

6 

21 

56 

126 

10 

11 

66 

286 

1001 

that  the  two  mixtures  of  binomials  are  the  same  for  all  possible  values,  x,  of  the 
binomial  random  variable  X.  Teicher  (1963)  expressed  this  condition  in  terms  of  the 
moments  of  the  binomial  distribution  and  as  a function  of  its  parameters,  and  showed 
that  this  system  of  equations  has  a unique  solution  if  n > 2k  — 1.  We  will  use  this 
same  approach  to  prove  Theorem  4.1,  which  extends  the  result  of  Teicher  (1963)  to 
multinomial  distributions.  For  the  multinomial  case,  however,  the  system  of  equations 
based  on  the  moments  of  the  multinomial  distribution  no  longer  has  n + 1 equations. 
The  number  of  equations  will  depend  on  the  multinomial  sample  size  n,  but  not  in 
the  simple  form  n + 1 as  in  the  binomial  case.  The  reason  is  that  condition  (4.18) 
must  now  hold  for  all  possible  vectors  y = (yi,  • ■ ■ , yq)  of  the  multinomial  random 
variable  Y with  R = q + 1 probabilities  and  sample  size  n,  where  the  vector  y is  an 
element  of  Kn  = {y  : y'  = (yu  • • • , yq),  £?=1  Vi  < n,  Vi  e {0, 1,  2,  • • • },  1 < i < q). 
The  number  of  equations  will  then  be  the  total  number  of  possible  vectors  y for  a 
given  multinomial  sample  size  n.  Table  4.1  lists  the  number  of  equations  for  a small 
set  of  multinomial  sample  sizes  n and  number  of  probabilities  R.  For  R = 2,  the 
binomial  case,  the  number  of  equations  is  again  seen  to  be  n + 1.  In  general,  the 
number  of  possible  equations,  Cnq,  for  a multinomial  sample  size  of  n with  R = q + 1 


160 


probabilities  can  be  found  from 


^"9  ~ X/  XI  !X  ' ' ' X/  ^9-2  + 1)) 


(4.20) 


i— 0 l\=0  ^2=0  lq-2=0 


where  Ym1=  o ' ' ' X^l2=o(^9-2  + 1)  is  defined  to  be  1 for  q = 1 and  (i  + 1)  for  q=2. 

We  begin  in  Theorem  4.1  by  characterizing  the  subfamilies  of  the  class  of  all 
finite  mixtures  of  multinomial  distributions  that  are  identifiable.  Letting  V = {n  : 
Yli=i 71  < < 1,  0 < 7Tj  < 1,  1 < i < q)  and  N be  the  positive  integers,  we  denote 
the  multinomial  distribution  with  parameters  n and  7r  as  M(y;n,7r),  where  7r  £ V, 
y £ and  n £ N.  In  Theorem  4.1  we  consider  mixtures  of  general  multinomial 
distributions  in  which  both  n and  n are  considered  as  parameters.  The  conclusions  of 
Theorem  4.1  lead  directly  to  necessary  and  sufficient  conditions  for  the  identifiability 
of  mixtures  of  multinomial  distributions  with  fixed  n. 

Theorem  4.1 


sizes  in  T\  (J  T2  as  hi  > h2  > ■ ■ ■ > hh  and  let  be  the  number  of  occurrences  of 
hi,  1 < i < h,  in  T\ 

(i)  A necessary  condition  for 


Let  T y = {M (y;  ni;  7rt),  7r,  £ V,  n-  £ N,  1 < i < k'}  and  T2  = {M( y;  n",  7r"),  tt"  £ 
V,  nj  £ N,  1 < i < k,  } denote  two  finite  families  of  multinomial  distributions  and 
let  k be  the  number  of  elements  in  T\  |J  T2.  Denote  the  h unique  multinomial  sample 


k 


II 


k 


Y.  Ci  M(y; n'i,  7r))  = y c"  M(y 


)» y^ = y^ = ° < c-,c. 


i=l  i=l 


(4.21) 


to  imply 


k'=k",  (n;,7r;)  = (n",7r") 


(4.22) 


161 


for  some  permutation  (fi,  ■ ■ ■ ,jk)  of  (1,  • • • , k)  is  that 


(4.23) 


(ii)  A sufficient  condition  that  (4.21)  imply  (4.22)  is  that  (4.23)  and 


f-^mq  Cn  i+iq  T ii  1 


(4.24) 


hold. 

Remark.  When  (4.21)  implies  (4.22),  the  class  of  all  finite  mixtures  of  k*  — k'  = k" 
elements  of  T = {M(y;  n,  tt),  tx  G V,  n e N}  is  said  to  be  identifiable. 

Proof 

First  note  that  (4.21)  is  equivalent  to 


where  M(y, rtj,  7Tj),  1 < i < k,  are  the  elements  of  |J Tf  Let  Sj  = J2i=iri 
and  s o = 0.  Without  loss  of  generality,  we  order  the  multinomial  distributions  in 
(4.25)  such  that  d\,  • • • , dsi  correspond  to  the  r i distributions  with  sample  size  h\, 
dsi+h  • • • ,dS2  correspond  to  the  r2  distributions  with  sample  size  h2,  and  so  on.  The 
moment  generating  function  for  the  multinomial  distribution  is 


where  Wj  = etj  — 1.  Since  the  moment  generating  function  of  a random  variable 
uniquely  determines  the  distribution  function  of  the  random  variable,  (4.25)  can  be 
expressed  as 


k k 


A/(y;nj,irj)  = 0,  = 0, 


(4.25) 


162 


which,  by  the  multinomial  theorem,  leads  to 

^=0,  (4.26) 

9 

where  we  define  [j  yl] ! = (m  - YH=i  Vn)-  x 11  Vij'.  ■ Careful  grouping  of  terms  in 
J j= i 

(4.26)  leads  to  the  following  sets  of  equations 


nt: 

11  3 t=l 

kq 

= 0,  k € A(4.27) 

n% 

11  3 i= 1 
3 

■ ■ nkq  + 

n2! 

n^! 

3 

S2 

! E ■ 

i=«i+l 

1 TTkl  ■ ■ 

kq 

• 7T  ■ 
“iq 

= 0,  k £ v4.2j(4.28) 

nl ! fc, 

n%!  ^di7Tii ' 

11  ] i=i 

7 

Alq  . 

’ • 7T-  + 

iq  1 

•••  + 

hh\ 

n ^3-  iz 

* W i 

M 

1 A1  ■ 

••4"  = 0,  k 6 A, (4.29) 

where  At  = { y : hi+1  + 1 < £ Vj  < hi,  Vi  6 {0, 1,  2,  • • • },  1 < * < q}. 

3= 1 

We  prove  (i)  by  contradiction.  Assume  that  (4.21)  implies  (4.22)  but  (4.23)  does 
not  hold,  that  is  that  Cnhq  < rh.  By  choosing  dt  = 0,  1 < i < sh- 1,  one  can  elicit 
a counter-example  in  which  (4.27)-(4.29)  and  (4.21)  hold,  but  (4.22)  does  not.  Such 
a counter-example  was  given  in  Example  4.1  where  (k',k")  = (n\ , n2)  = {n[ , n2 ) = 
(2,2)  and  (du  d2,  d3,  d4)  = (.5,  .5, -.038462, -1.038462).  Thus  (i)  holds. 

To  see  that  (ii)  is  true,  note  that  when  (4.23)  and  (4.24)  hold,  for  each  t = 1,  • • • ,h 
and  A(.)  defined  as  before 

— | St 

tT^TT  nil  ' ' ’ niq  =0,  k € At 

j <=.r-,+l 

is  a set  of  ET  = Cnrq  — CnT+iq  equations  with  rT  unknowns,  where  ET  > rT.  Thus  the 
only  solution  for  each  set  of  equations  is  the  zero  solution.  That  is  d{  = 0,  1 < * < k, 
which  satisfies  (4.21)  and  implies  (4.22)  since  c-,c"  > 0. 

□ 


163 


An  immediate  consequence  of  Theorem  4.1,  paralleling  Proposition  4 of  Teicher 
(1963),  is  the  following: 

Corollary  4.1 

Let  T = {M(y;n,  7r),  7t  £ V]  denote  a family  of  multinomial  distributions  with 
parameter  tx  and  fixed  sample  size  n.  A necessary  and  sufficient  condition  that  the 
class  of  all  finite  mixtures  of  at  most  k elements  of  T be  identifiable  is  that  Cnq  > 2k. 

Proof 

In  Theorem  4.1  let  = {M(y;n,  7T-),  7r)  £ V,  1 < * < k'}  and  F2  = 
{M(y;n,  ni),  7xi  £ V,  1 < i < k"}.  Since  the  only  distinct  multinomial  sample 
size  in  T\  (J  T2  is  n,  nx  = = n and  = k'  + k" . From  (i)  in  Theorem  4.1  a 

necessary  condition  for  (4.21)  to  imply  (4.22)  is  that  Cnq  > (k'  + k")  = 2k  (by 
(4.22)).  But  this  also  obtains  sufficiency  since  condition  (4.24)  holds  trivially  when 
all  multinomial  sample  sizes  are  the  same. 

□ 

Applying  Corollary  4.1  and  (4.20)  for  q = 1,  one  obtains  the  result  of  Teicher  (1963) 
that  finite  mixtures  of  binomial  distributions  are  identifiable  if  and  only  if  n > 2k  — 1. 
Thus  mixtures  of  binomials  with  sample  size  one  are  not  identifiable.  On  the  contrary, 
mixtures  of  multinomials  with  sample  size  one  and  R > 3 are  identifiable.  Using 
(4.20)  one  can  see,  for  example,  that  for  q = 3,  4 the  number  of  identifiable  mixture 
components  for  mixtures  of  multinomials  with  sample  size  one  is  two. 

Teicher  (1967)  considered  the  identifiability  of  general  mixtures  of  products  of  dis- 
tributions. Let  = {F(y;  a)  : a £ R.m}  represent  a family  of  distributions  F( y;  a) 
and  let  T*n  = {F*( y;  a)  : F*( y;  a)  = n"=i  F(Yb  a<),  F(yu  <*»)  € 1 < * < n}  so 

that  if  Yi,  • • • , Y„  are  independent  random  variables  each  of  whose  distributions  is  in 
their  joint  distribution  is  an  element  of  T*.  Teicher  (1967)  showed  the  following: 


164 


Theorem  4.2  (Theorem  2,  Teicher  [1967]) 

If  the  class  of  all  finite  mixtures  of  is  identifiable,  then  for  every  n > 1 the  class 
of  finite  mixtures  of  T*  is  likewise  identifiable. 

Using  Theorem  4.2  and  Corollary  4.1,  we  conclude  this  section  by  giving  conditions 
under  which  mixtures  of  products  of  multinomial  distributions  are  identifiable. 

Theorem  4.3 

Let  Ti  — {M(y;  rij,  7Tj),  7T;  € V},  1 < i < N,  be  N families  of  multinomial 

distributions  with  parameter  tv,  and  fixed  sample  size  n.i  and  let  Tn  = (F(y;  n,  tv)  : 
F(y;n,  7r)  = Yl^=1M(y,ni,TVi),  M(y;nj,7rj)  e Ti,  1 < * < N}.  Then  the  class  of 
all  finite  mixtures  of  at  most  k elements  of  TN  is  identifiable  if  min  \Cn  o } > 2 k. 

i sis'  ^r  L 1 ^ J 


identifiable  if  and  only  if  Cntq  > 2k,  1 < i < N.  Let  C(1)  = min  {Cntq}.  Clearly 
Corollary  4.1  holds  simultaneously  for  all  N families  when  Cp)  > 2k.  Therefore, 
applying  Theorem  4.2,  the  class  of  all  finite  mixtures  of  at  most  k elements  of  is 
identifiable  if  C(p  > 2k. 


As  in  Theorem  4.1,  finite  mixtures  of  products  of  distribution  functions  that  are 
identifiable  satisfy  the  following  uniqueness  of  representation  property: 


l<i<N 


Proof 


By  Corollary  4.1,  each  class  of  all  finite  mixtures  of  at  most  k elements  of  Ti  is 


□ 


(4.30) 


k 


k 


n 


implies 


k'  = k ">  Kn  • • • . O = (*Ti,  , *In) 


165 


for  some  permutation  (Zx,  • • • , lk)  of  (1,  • • • , k).  The  equality  in  (4.30)  holds  for  all 
y*N  = (yi)--’  ,Yn)  where  y j € is  applied  to  the  j th  multinomial  in  (4.30), 
1 < j < N. 

4.3.2  Mixtures  of  Multinomial  Regression  Models 

Using  results  of  the  previous  section,  we  now  consider  the  identifiability  of  finite 
mixtures  of  overdispersed  multinomial  logit  regression  models.  There  has  been  rel- 
atively little  research  on  the  identifiability  of  mixtures  of  logistic  regression  models 
for  binary  responses.  Wood  and  Hinde  (1987)  mention  that  such  models  are  identifi- 
able if  at  least  one  regression  variable  is  continuous  and  are  identifiable  only  in  rare 
circumstances  when  the  covariates  are  all  discrete.  They  do  not,  however,  provide 
conditions  under  which  particular  models  would  be  identifiable.  For  finite  mixtures  of 
logistic  regression  models  with  random  intercepts,  Follmann  and  Lambert  (1991)  pro- 
vide sufficient  conditions  that  ensure  that  the  regression  parameters  and  the  mixing 
distribution  of  the  intercept  are  identifiable.  For  mixtures  of  binomials  with  sample 
sizes  greater  than  one,  they  apply  the  results  of  Teicher  (1963)  to  bound  the  number 
of  components  in  the  mixing  distribution.  For  mixtures  of  Bernoulli  distributions, 
they  estimate  the  mixing  distribution  from  distributions  with  similar  covariate  val- 
ues. In  this  case,  the  bound  on  the  number  of  components  of  the  mixing  distribution 
depends  on  the  number  of  covariate  vectors  that  agree  on  all  coordinates  except  for 
one.  For  both  sets  of  conditions,  the  logistic  regression  models  that  they  consider  al- 
low for  mixtures  of  one  binomial  or  one  Bernoulli  distribution.  More  recently  Butler 
and  Louis  (1997)  considered  a latent  linear  model  for  binary  data  having  the  struc- 
ture of  (1.1)  with  any  class  of  mixing  distribution.  They  give  sufficient  conditions  for 
identifiability  of  the  fixed  effects  and  mixing  distribution  as  well  as  for  convergence 
of  their  maximum  likelihood  estimators.  Though  their  results  are  general  and  can 
be  applied  to  a wide  range  of  models,  the  conditions  for  assuring  identifiability  are 


166 


quite  esoteric  and  do  not  provide  information  as  to  the  number  of  mass  points  that 
are  identifiable. 

We  begin  by  defining  identifiability  within  the  context  of  overdispersed  multi- 
nomial logit  regression  models.  As  in  Section  4.2,  we  consider  models  of  the  form 
Vij  — Zij/3  + Ui  where  the  random  effect  rq  is  assumed  to  have  a discrete  distri- 
bution G with  finite  masses  py  > 0,  • • • , pK  > 0,  J2k=iPk  = 1>  and  mass  points 
m-y  < • • • < mK.  For  overdispersed  multinomial  logit  models,  the  number  of  ob- 
servations per  cluster  is  one  (T)  = 1,  i = 1 ,-••  ,n),  thus  we  drop  the  j subscript 
for  the  remainder  of  this  section.  We  assume  that  conditional  on  the  random  effect 
Ui,  the  multinomial  observation  y*,  with  corresponding  probability  vector  7T;,  has 
distribution 


where  n,;  denotes  the  multinomial  sample  size.  The  parameter  7 in  the  multinomial 
distribution  can  take  any  of  the  forms  given  in  Chapter  2.  The  mixed,  unconditional 
probability  function  for  y;  is  given  by 


As  one  of  the  mk  is  aliased  with  one  of  the  thresholds,  we  assume  that  the  column 
corresponding  to  the  first  threshold  in  Zi  has  been  removed  to  allow  direct  estimation 
of  all  K mass  points.  Let  QK<  be  the  set  of  all  discrete  distributions  on  (—00,00) 
with  a support  size  of  at  most  K'  points.  For  given  (z^n*),  i = 1,-  • • , n,  we  define 
the  set  of  parameters  {( G,(3 ) : G G QK<  ,/3eW}  to  be  identifiable  if 


/(y<  I ^ii  ft  1 l^i) 


K 


(4.32) 


167 


implies  that 

(G,(3)  = (G*,(3)  for  G,  G*  £ QK’  and  (3,(3*  £W, 

where  G and  G*  have  support  size  K < K'  and  K*  < K' , respectively. 

Using  Theorem  4.3  we  now  give  a sufficient  condition  for  the  identifiability  of 
model  (4.31)  by  bounding  the  support  size  K.  This  generalizes  Theorem  1 of  Follmann 
and  Lambert  (1991)  which  provides  conditions  for  the  identifiability  of  binary  logistic 
regression  models  where  the  mixing  is  over  a single  binomial  distribution.  Loosely 
stated,  we  use  the  cluster  with  the  largest  multinomial  sample  size  to  identify  the 
mixing  distribution,  while  the  remaining  clusters  are  used  to  identify  the  regression 
parameters. 

Theorem  4.4 

Consider  the  set  of  finite  mixed  multinomial  regression  models  defined  in  equation 
(4.31)  where  zir  £ for  r = 1,  • • ■ , q,  i = 1,  • • • , n,  and  Zi  = [zir\.  Let  the  index  I 

be  such  that  Ci  = max{C„l!g,  • • • , where  Cnitq  is  defined  as  in  (4.20).  If  each 

set  of  vectors  {zir  Z/r,  , z/r,  z /+i;r  z/r,  • • • , znr  z/r} , v = 1,  • • • , q, 

spans  Rp  then  {( G,(3 ) : G £ QK',(3  £ W}  is  identifiable  for  K'  < ~ Cj. 

Proof 

Consider  distributions  G and  G*  in  QK*  that  satisfy  equation  (4.32).  Then  the 
mixed  multinomial  distribution  with  parameters  (ri/,  tvj((3))  and  mixing  distribu- 
tion G,  and  the  mixed  multinomial  distribution  with  parameters  (n/,  7T/(/3*))  and 
mixing  distribution  G*  are  identical.  Since  finite  mixtures  of  multinomials  without 
covariates  are  identifiable  if  K < K'  ( Corollary  4.1),  G = G*  and  nIr(mk  + zIr(3)  = 
nir(m*k  + z/r/3*)  for  r = 1,  • ■ • ,q  and  k = 1,  • ■ • ,K  . Because  7r(-)  is  monotone, 

mk  = mk  + Z/r/3  — z Ir(3* , for  r = 1,  • • ■ , q and  k = 1,  • • • , K' . 


168 


Since  equality  of  distributions  implies  equality  of  means, 
k'  k' 

ni^Pk  7r ir{mk  + zir0)  =ni^2pk  ixir{mk  + zIr(0  - 0*)  + z'ir0*), 

k= 1 k=l 

r = 1,  • • ■ , q.  Again  using  the  monotonicity  of  n(-), 

mk  + z'Ir0  = mk  + zlT{0  - 0*)  + z'ir0* 

0 = z'ir(0  - 0*)  - zIr(0  - 0*) 

0 = (z<r  - zIr)'(0  -0*),  (4.33) 

for  r = 1,  • • • ,q,  i = 1,  • • • ,n.  Finally,  because  of  the  assumption  that  {zlr  — 
z/r,  • • • , z/_i,r  - z/r,  Z/+I>r  - Zlr,  • • • , znr  - z/r},  r = 1,  • • • , q,  spans  Ep,  condition 
(4.33)  holds  only  if  ( 0 — 0*)  = 0 , or  0 — 0* . Hence,  it  follows  that  mk  = m*k, 
k = I,---  ,K'.  □ 

In  Theorem  4.4,  we  assumed  that  a common  parameter  vector  0 held  for  all 
logits.  The  theorem  still  holds,  however,  if  separate  parameter  vectors  0r  are  allowed 
for  each  logit.  As  an  application  of  Theorem  4.4,  we  consider  the  toxicity  dataset 
found  in  Table  3.3.  Theorem  4.4  bounds  the  identifiable  support  size  by  the  largest 
\C n„q  over  all  clusters.  For  the  toxicity  dataset,  the  largest  multinomial  sample 
is  16.  Using  the  definition  of  Cnuq  in  (4.20),  the  maximum  identifiable  number  of 
components  for  a multinomial  sample  size  of  16  with  three  response  probabilities  is 
76.  Thus,  one  need  not  worry  about  identifiability  when  fitting  the  shifted  threshold 
model  to  this  dataset.  As  noted  before,  Theorem  4.4  can  be  used  when  a single 
multinomial  observation  is  observed  for  each  cluster.  It  is  also  important  to  establish 
similar  conditions  when  more  than  one  multinomial  observation  is  collected  for  each 
cluster.  One  might  be  able  to  accomplish  this  using  Theorem  4.3,  which  establishes 
identifiability  of  products  of  multinomial  distributions.  More  research  is  still  needed 
in  this  area. 


169 

4.4  Application 

We  now  apply  the  NPML  algorithm  of  Section  4.2.1  to  the  wine  dataset  considered 
in  the  previous  chapter.  As  before,  we  simply  illustrate  the  NPML  estimation  method 
as  opposed  to  providing  a thorough  analysis.  We  also  assume  that  the  models  are 
identifiable.  For  the  bitterness  of  wine  dataset,  we  apply  the  NPML  algorithm  using 
both  the  cumulative  logit  and  adjacent-category  logit  links  to  fit  the  shifted  threshold 
model.  From  the  analysis  of  the  toxicity  data  and  satisfaction  data  in  the  previous 
chapter,  it  was  clear  that  the  shifted  threshold  model  was  inadequate  and  that  the 
vary  threshold  model  was  more  appropriate.  Though  we  motivated  the  NPML  algo- 
rithm for  the  shifted  threshold  model,  it  can  be  modified  to  fit  the  varying  threshold 
model  as  well.  We  will  consider  such  modifications  in  Chapter  5. 

4.4.1  Cumulative  Logit  Link 

Consider  again  the  data  in  Table  3.2  consisting  of  ratings  of  wines  with  respect 
to  bitterness.  Recall  that  each  of  nine  judges  rated  eight  wines  on  a five-point  scale 
ranging  from  least  to  most  bitter.  Factors  in  the  experiment  included  temperature 
(TE)  of  the  wine  (cold/warm),  whether  there  was  contact  (CO)  with  the  skin  when 
the  grapes  were  crushed  (yes/no),  and  bottle  (BO)  number  (first/second).  As  each 
wine  judge  has  a particular  sensitivity  to  the  bitterness  of  wine,  one  would  expect 
their  responses  to  be  correlated.  We  again  model  the  heterogeneity  among  judges 
by  allowing  the  thresholds  to  be  shifted  for  each  judge.  However,  we  now  assume 
that  the  random  effect  follows  a discrete  mixing  distribution.  We  first  consider  a 
cumulative  logit  model  with  linear  predictor 

rh]T  = oir  + PrEXiji  + fico%ij2  + &Boxij3  + Ui, 

r = 1,  • • • , R - 1,  j = 1,  • • • , T,  i = 1,  • • • , n, 


(4.34) 


170 


where  R = 5,  T = 8,  and  n — 9.  In  (4.34)  f3TE,  (3Co , and  /3Bo  are  the  parameter 
coefficients  for  the  temperature,  contact,  and  bottle  factors,  respectively.  As  in  the 
analysis  in  Section  3.5.1,  the  factors  were  coded  1 and  -1  to  correspond  to  the  original 
analysis  by  Tutz  and  Hennevogl  (1996). 

Using  the  algorithm  discussed  in  Section  4.2.1,  we  calculated  maximum  likelihood 
estimates  for  model  (4.34)  by  successively  increasing  the  support  size  K.  Table  4.2 
contains  the  NPML  estimates  for  support  sizes  of  two,  three,  and  four,  as  well  as 
the  maximum  likelihood  estimates  obtained  using  5-point  adaptive  Gauss-Hermite 
(AGH(5))  quadrature  where  the  random  effect  was  assumed  to  be  normal.  For  the 
NPML  approach,  we  fit  a no-intercept  version  of  model  (4.34)  to  obtain  direct  esti- 
mates of  the  mass  points.  An  estimate  of  the  suppressed  threshold,  ai,  was  obtained 
from  the  mean  of  the  mixing  distribution 

k 

A k — EWi)  = (4.35) 

k— 1 

To  obtain  the  correct  estimates  of  the  remaining  thresholds,  the  estimates  of  a2,  a 3, 
and  a4  from  the  original  fit  of  model  (4.34)  must  by  increased  by  The  estimates 
of  the  standard  deviation  of  the  mixing  distribution  given  in  Table  4.2  were  calculated 
using  the  standard  variance  formula  for  discrete  random  variables 

k 

b\  = V[ui]  = ^ ™lPk  - A*  • (4.36) 

fc=i 

It  is  clear  from  Table  4.2  that  the  NPML  estimate  of  the  mixing  distribution  has 
a support  size  of  three.  The  log-likelihood  values  for  support  sizes  of  two,  three, 
and  four  are  -82.583,  -80.237,  and  -80.237,  respectively,  indicating  that  increasing 
the  mass  point  from  three  to  four  is  unnecessary.  In  addition,  mass  point  two  from 
the  model  with  three  mass  points  is  repeated  when  K is  increased  to  four.  Thus 
the  estimated  mixing  distribution  has  mass  points  -1.522,  -4.023,  and  -6.007,  with 
respective  masses  .113,  .676,  and  .212.  By  centering  the  mixing  distribution  about 


171 


Table  4.2:  Nonparametric  maximum  likelihood  analysis  of  model  (4.34)  for  the  wine 
bitterness  dataset  using  the  cumulative  logit  link.  Estimates  were  obtained  for  sup- 
port sizes  of  two,  three,  and  four.  The  final  column  contains  the  estimates  using 
5-point  adaptive  Gauss-Hermite  (AGH(5))  quadrature  under  the  assumption  of  a 
normal  random  effect. 


Nonparametric  ML  ML 


Estimated  Support  Size  (K) 

Parameter 2 3 4 AGH(5) 


a i 

-3.820 

-4.161 

-4.161 

-4.082 

a 2 

-0.898 

-0.952 

-0.952 

-0.930 

1.663 

1.840 

1.840 

1.797 

a.4 

3.498 

3.763 

3.763 

3.657 

Pte 

1.468 

1.562 

1.562 

1.536 

(0.292) 

(0.300) 

(0.301) 

(0.298) 

Pco 

0.862 

0.938 

0.938 

0.916 

(0.251) 

(0.262) 

(0.256) 

(0.256) 

Pbo 

0.115 

0.124 

0.124 

0.122 

(0.230) 

(0.235) 

(0.236) 

(0.232) 

G 

0.934 

1.232 

1.232 

1.145 

Mass  Point  m i 

-3.189 

-1.522 

-1.522 

Mass  pi 

0.687 

0.113 

0.113 

Mass  Point  m2 

-5.203 

-4.023 

-4.023 

Mass  p2 

0.313 

0.676 

0.379 

Mass  Point  m3 

-6.007 

-4.023 

Mass  p3 

0.212 

0.297 

Mass  Point  m4 

-6.007 

Mass  p4 

0.212 

Log-likelihood 

-82.583 

-80.237 

-80.237 

-81.394 

172 


its  mean,  the  centered  mass  points  are  found  to  be  2.639,  0.138,  and  -1.846.  Thus 
the  mixing  distribution  is  somewhat  symmetric  about  zero. 

We  now  compare  the  results  from  the  three  point  NPML  fit  to  the  adaptive 
quadrature  results.  We  can  see  that  the  estimates  for  all  parameters  are  very  similar 
between  the  two  approaches.  The  NPML  parameter  estimates  and  standard  errors 
are  consistently  larger,  but  only  marginally  so.  The  standard  deviation  estimate 
based  on  the  three  point  discrete  distribution  was  1.232,  compared  with  1.145  under 
the  assumption  of  normality.  Statistical  conclusions  regarding  the  significance  of 
the  effect  parameters  would  be  the  same  for  both  approaches.  We  note  that  the 
maximized  log-likelihood  value  for  the  NPML  approach  was  slightly  larger  (-80.237) 
than  the  adaptive  quadrature  approach  (-81.394).  This  usually  occurs  as  the  NPML 
algorithm  has  a greater  number  of  parameters  due  to  the  masses  and  mass  points. 

4.4.2  Adjacent-Category  Logit  Link 

We  likewise  used  the  NPML  algorithm  to  fit  model  (4.34)  with  the  adjacent- 
category  logit  link.  Recall  that  the  adjacent-category  logit  model  provides  odds  ratio 
estimates  for  adjacent  pairs  of  responses.  Thus,  interpretations  are  based  on  a subset 
of  the  response  scale  as  opposed  to  the  entire  response  scale  as  in  the  cumulative  logit 
model.  Parameter  estimates  for  support  sizes  of  two,  three,  and  four  are  found  in 
Table  4.3  along  with  the  adaptive  quadrature  results  reported  in  Section  3.6.1. 

We  again  see  that  only  a three-point  mixing  distribution  is  needed  to  obtain  the 
NPML  estimates.  Note  that  with  a support  size  of  four  we  obtain  a redundancy  in  the 
first  mass  point  (1.381).  Parameter  estimates  and  standard  errors  for  the  three-point 
NPML  results  and  the  adaptive  quadrature  results  are  in  close  agreement.  Slightly 
larger  values  are  obtained  for  the  NPML  approach,  as  was  seen  in  Table  4.2.  Both 
approaches  would  yield  the  same  statistical  conclusions. 


173 


Table  4.3:  Nonparametric  maximum  likelihood  analysis  of  model  (4.34)  for  the  wine 
bitterness  dataset  using  the  adjacent-category  logit  link.  Estimates  were  obtained 
for  support  sizes  of  two,  three,  and  four.  The  final  column  contains  the  estimates 
using  8-point  adaptive  Gauss-Hermite  (AGH(8))  quadrature  under  the  assumption 
of  a normal  random  effect. 


Nonparametric  ML  ML 


Estimated  Support  Size  (K) 

Parameter  2 3 4 AGH(8) 


Oil 

0.000 

-0.016 

-0.016 

-0.009 

Oi2 

-0.947 

-1.117 

-1.117 

-1.043 

-1.718 

-2.008 

-2.008 

-1.880 

Qq 

-2.037 

-2.415 

-2.415 

-2.250 

Pte 

-1.067 

-1.219 

-1.219 

-1.149 

(0.246) 

(0.275) 

(0.275) 

(0.268) 

Pco 

-0.614 

-0.700 

-0.700 

-0.659 

(0.192) 

(0.210) 

(0.210) 

(0.203) 

Pbo 

-0.052 

-0.059 

-0.059 

-0.056 

(0.161) 

(0.172) 

(0.172) 

(0.167) 

a 

0.662 

0.971 

0.971 

0.839 

Mass  Point  m\ 

1.097 

1.381 

1.381 

Mass  pi 

0.267 

0.228 

0.001 

Mass  Point  ra2 

-0.399 

-0.145 

1.381 

Mass  P2 

0.733 

0.661 

0.227 

Mass  Point  m3 

-2.110 

-0.145 

Mass  p3 

0.111 

0.661 

Mass  Point  m4 

-2.110 

Mass  pi 

0.111 

Log-likelihood 

-81.683 

-79.495 

-79.495 

-80.853 

174 


4.5  Simulation  Studies 

Since  the  underlying  distribution  of  the  random  effect  is  typically  unknown,  the 
nonparametric  approach  provides  a potentially  robust  alternative  to  the  models  dis- 
cussed in  Chapter  3.  In  this  section  we  present  two  simulation  studies  in  which  we 
compare  the  NPML  approach  to  that  of  the  ML  approach  considered  in  the  previous 
chapter.  In  the  first  simulation  study  we  compare  the  bias  in  parameter  estimates 
between  the  two  approaches  for  a single  covariate,  random  intercept  cumulative  logit 
model  using  a variety  of  different  random  effects  distributions.  In  the  second  study 
we  examine  the  behavior  of  the  likelihood-ratio  and  Wald  test  statistics  for  testing  a 
fixed  effect  parameter  by  comparing  the  rejections  rates  between  the  NPML  and  ML 
approaches. 

4.5.1  Simulation  Study  I 

A number  of  authors  have  used  simulation  studies  to  explore  the  behavior  of  the 
NPML  and  ML  approaches.  Follmann  and  Lambert  (1989)  conducted  a small  sim- 
ulation study  to  test  their  conjecture  that  the  knowledge  of  K was  unimportant  for 
estimating  standard  errors.  They  considered  two  overdispersed  logistic  regression 
models  each  with  a single  binary  covariate  in  which  they  assumed  the  mixing  distri- 
bution was  either  a normal  distribution  or  a two  point  discrete  distribution.  From 
their  simulations  they  concluded  that  calculating  standard  errors  by  treating  the 
estimated  support  size  K as  the  true  value  did  not  adversely  affect  the  standard  er- 
ror estimates.  Neuhaus  et  al.  (1992)  investigated  the  effects  of  misspecification  of  the 
mixing  distribution  when  fitting  random  intercept  logistic  regression  models  using  the 
ML  approach  of  the  previous  chapter.  Using  the  normal,  gamma,  and  t-distribution 
as  mixing  distributions,  they  measured  the  bias  in  the  parameter  estimates  and  stan- 
dard errors  when  one  assumed  the  mixing  distribution  was  normal.  They  concluded 


175 


that  the  ML  approach  was  quite  robust  to  misspecificatiou  of  the  mixing  distribu- 
tion. However,  inferences  concerning  the  mixing  distribution  was  much  less  robust  to 
model  misspecification. 

Butler  and  Louis  (1992)  compared  parametric  and  nonparametric  approaches  for 
random  effects  models  in  both  linear  and  logistic  models.  They  utilized  simulation  to 
study  the  moderate  and  large-sample  performance  of  the  NPML  method  to  that  of  the 
Gaussian  method  within  the  context  of  random  intercept  logistic  regression  models. 
For  all  simulations  they  considered  only  the  standard  normal  distribution  for  the  true 
mixing  distribution.  They  concluded  that  the  NPML  method  performed  efficiently 
when  the  true  mixing  distribution  was  Gaussian.  Recently  Aitkin  and  Alfo  (1998) 
proposed  a set  of  conditional  models  for  modeling  binary  longitudinal  responses  which 
mixes  features  of  random  effects  models  and  transition  models  and  allows  for  a general 
association  structure  between  different  observations  on  the  same  individual.  They 
presented  both  a parametric  estimation  method,  based  on  Gauss-Hermite  quadrature, 
and  an  NPML  estimation  method  for  fitting  the  proposed  models.  To  assess  the 
practical  performance  of  their  proposed  models,  Aitkin  and  Alfo  (1998)  conducted  a 
large  simulation  study  using  a first-order  Markov  chain  model  with  a single  covariate 
and  random  effect.  In  the  simulation  experiment  they  varied  the  number  of  subjects 
analyzed,  the  number  of  observations  per  subject,  the  true  parameter  values  for  the 
two  covariates,  and  they  considered  six  possible  distributions  for  the  single  random 
effect.  Though  their  main  objective  was  to  compare  their  proposed  set  of  models, 
Aitkin  and  Alfo  (1998)  noted  that  misspecification  of  the  number  of  components  of 
the  mixing  distribution  generally  caused  bias  in  only  the  autoregressive  term  of  the 
model  and  not  in  the  covariate  term. 

We  conducted  a set  of  simulations  to  compare  the  performance  of  the  nonpara- 
metric approach  to  that  of  the  normal  random  effects  approach  for  datasets  with 
clustered  ordinal  outcomes  and  varying  mixture  distributions.  For  the  simulations 


176 


we  generated  clustered  ordinal  response  data  according  to  a cumulative  logit  model 
with  linear  predictor 


rjijr  = ar  + /3  Xij  + Ui,  r = 1,  • • • , R - 1,  (4.37) 

where  the  number  of  responses  R— 3,  *'  = 1,  • • • , n,  and  j = 1,  • • • , T.  For  all  simula- 
tions, the  {xij}  in  model  (4.37)  were  generated  from  the  standard  normal  distribution 
and  the  fixed  parameter  vector  (o-i,  a2,  (3)  = (-1, 1,  0.5).  We  considered  two  sets  of 
simulations,  a larger  set  in  which  the  number  of  clusters,  n,  was  fixed  at  100,  and 
smaller  set  where  n = 1000.  In  the  larger  set  we  varied  the  number  of  observations  per 
cluster  T = (4  or  7)  and  the  true  distribution  of  the  random  effect  U{.  We  considered 
five  cases  for  the  mixture  distribution:  iV(0,  a2  = 0.5),  Exp(  1),  U(- 0.5;  0.5),  discrete 
with  mass  points  (-0.5, 0.5)  and  masses  (0.5, 0.5),  and  no  random  effect.  For  each  case 
except  the  last,  the  random  effect  was  standardized  to  have  mean  zero  and  variance 
0.5.  In  the  smaller  set  of  simulations,  only  a subset  of  the  cluster  sizes  and  mixture 
distributions  were  considered  due  to  the  computational  burden  of  using  1000  clusters. 
Here  we  fixed  the  cluster  size  at  seven  and  consider  only  the  normal  and  exponential 
distributions.  The  two  simulation  sets  can  be  summarized  as  follows.  The  numbers 

inside  the  table  indicate  the  total  number  of  simulations  run  for  each  algorithm. 

SET  I SET  II 


CLUSTER  SIZE: 

4 

7 

7 

NORMAL 

500 

500 

100 

EXPONENTIAL 

500 

500 

100 

UNIFORM 

500 

500 

DISCRETE 

500 

500 

NO  RANDOM  EFFECT 

500 

500 

177 


To  assess  the  performance  of  the  ML  and  NPML  approaches,  Monte  Carlo  esti- 
mates of  the  bias  in  parameter  estimates  for  each  approach  were  calculated  where 
the  bias  for  9 is  defined  as:  Bias  = E(9)  — 9.  For  n = 100  a pilot  study  was  used  to 
determine  that  500  simulations  were  needed  to  yield  Monte  Carlo  standard  error  es- 
timates of  less  than  0.01.  Thus  using  500  simulations  at  each  of  the  ten  combinations 
of  cluster  size  and  mixture  distribution,  estimates  of  the  bias  of  the  fixed  parameters 
(a>i,a2,(3)  and  the  variance  of  the  mixing  distribution  a2  were  calculated  by  fitting 
model  (4.37)  using  the  NPML  approach  and  the  ML  approach  of  the  previous  chap- 
ter. For  the  simulation  set  using  1000  clusters,  only  100  simulations  were  run  for  the 
normal  and  exponential  distributions.  As  in  Follmann  and  Lambert  (1989),  average 
standard  error  estimates  for  /5  were  calculated  from  both  the  observed  information 
matrix  at  each  model  fit  and  from  the  Monte  Carlo  estimate  across  all  simulations. 

For  the  ML  approach,  we  used  15-point  adaptive  Gauss-Hermite  quadrature  to 
directly  maximize  the  log-likelihood.  Starting  values  for  the  fixed  effect  parameters 
were  obtained  by  fitting  the  fixed  effects  logistic  regression  model  while  the  starting 
value  of  0.75  was  used  for  a2.  For  each  model  fit,  the  standard  errors  of  the  parameters 
were  calculated  by  evaluating  the  analytical  observed  information  matrix.  For  the 
NPML  approach,  the  ECME-BFGS  algorithm  was  used  to  obtain  estimates  of  the 
fixed  effects  and  mixing  distribution.  The  estimates  of  the  mass  points  and  masses 
were  then  used  to  estimate  a2.  At  each  simulation,  the  support  size  K had  to  be 
estimated  as  well.  We  accomplished  this  by  overfitting  the  number  of  mass  points 
(Aitkin  1996).  For  each  combination  we  began  with  K fixed  at  seven  and  successively 
reduced  K until  convergence  was  obtained.  In  order  to  determine  convergence  we 
needed  to  define  some  stopping  rules.  Upon  convergence  at  a given  K,  mass  points 
that  were  less  than  0.0001  apart  were  combined  and  K was  reduced  accordingly.  Also, 
if  any  masses  were  less  than  0.0001,  they,  and  their  corresponding  mass  point,  were 
removed  and  K adjusted  accordingly.  If  at  K no  further  points  could  be  removed, 


178 


an  additional  run  was  made  at  K — 1 and  log-likelihoods  compared  to  ensure  that 
convergence  had  been  met.  Upon  convergence  standard  errors  were  calculated  from 
the  analytical  observed  information  matrix.  Initial  estimates  of  the  fixed  parameters 
were  obtained  in  the  same  manner  as  in  the  ML  approach.  The  initial  estimates  of 
the  mass  points  were  obtained  from  {logit (j^),  v = 1 ,K}  while  the  masses 

were  set  at  l/K. 

For  some  simulated  datasets,  the  NPML  algorithm  attempted  to  converge  to  mass 
points  that  were  at  plus  or  minus  infinity.  In  such  instances  the  mass  point  estimate 
continues  to  grow  to  plus  or  minus  infinity  with  little  increase  in  the  overall  log- 
likelihood  value.  Thus  the  algorithm  would  signify  convergence  with  the  offending 
mass  point  having  an  extremely  large  absolute  value.  Since  reducing  the  cluster  size 
yielded  a smaller  log-likelihood  fit,  we  did  not  replace  these  simulations.  However, 
such  extreme  estimates  usually  produced  problems  when  calculating  the  observed 
information  matrix.  Thus  we  report  two  sets  of  tables  with  and  without  the  problem 
simulations. 

The  results  of  the  first  set  of  simulations  are  found  in  Tables  4.4  - 4.22,  grouped 
by  random  effects  distribution  and  cluster  size.  As  found  by  Neuhaus  et  al.  (1992), 
there  was  little  estimated  bias  in  the  estimation  of  the  regression  coefficient  /3  for 
the  parametric  maximum  likelihood  approach.  This  was  true  even  when  the  mix- 
ing distribution  was  extremely  skewed  (i.e.  the  exponential  distribution,  Tables  4.12 
and  4.14).  The  largest  estimated  biases  in  /3,  0.013  and  0.012,  occurred  for  cluster 
sizes  of  four  for  both  the  uniform  random  effect  (Table  4.10)  and  with  no  random 
effect  (Table  4.21),  respectively,  and  these  could  be  explained  by  Monte  Carlo  error. 
This  corresponds  to  a percent  bias  (Bias/True  Value)  of  approximately  2.5%,  which 
is  similar  to  the  percent  biases  reported  by  Neuhaus  et  al.  (1992).  In  general,  the 
simulations  with  cluster  size  four  tended  to  exhibit  larger  bias  in  /?  then  the  corre- 
sponding simulations  with  cluster  size  seven.  Across  all  simulations  there  was  strong 


179 


agreement  between  the  simulation  and  model  based  estimates  of  the  standard  error 
for  (3.  This  suggests  that  valid  variance  estimates  of  the  estimated  covariate  effects 
can  be  obtained  even  under  misspecification. 

In  contrast  to  the  estimation  of  /5,  the  estimates  for  the  threshold  parameters  in 
the  parametric  approach  were  influenced  by  the  skewness  of  the  mixing  distribution. 
The  largest  estimated  biases  occurred  for  the  exponential  distribution,  regardless  of 
the  cluster  size  (Tables  4.12  - 4.15),  and  were  on  the  order  of  2%.  The  simulations 
for  the  remaining  distributions  exhibited  percent  biases  of  0.5%  to  1%.  As  expected, 
the  parametric  approach  provided  very  accurate  estimates  of  the  variance  component 
a2  with  a normal  random  effect  (Tables  4.4  and  4.6).  There  was  considerable  bias 
in  er2  for  the  remaining  distributions,  however.  In  general,  larger  estimated  biases 
were  found  for  the  smaller  cluster  size.  For  example,  absolute  percent  biases  of  about 
12%  were  found  for  the  exponential  and  no  random  effect  cases  with  cluster  sizes  of 
four  (Tables  4.14  and  4.21).  As  noted  before,  the  estimated  standard  errors  from  the 
observed  information  matrix  for  (3  still  agreed  well  with  the  Monte  Carlo  estimates 
even  when  a2  was  not  well  estimated. 

We  now  turn  our  attention  to  the  simulation  results  for  the  nonparametric  ap- 
proach. When  averaged  across  all  simulations,  the  estimated  bias  in  the  regression 
parameter  f3  in  the  nonparametric  approach  was  very  similar  to  that  of  the  paramet- 
ric approach.  It  is  also  evident  that  the  estimation  of  f3  was  not  adversely  affected 
by  mass  points  at  plus  or  minus  infinity.  Consider  Tables  4.14  and  4.15  which  con- 
tain the  results  for  the  exponential  distribution  with  cluster  size  of  four.  A large 
number  of  the  simulated  datasets  (115)  resulted  in  a mass  point  at  plus  or  minus 
infinity.  However  the  estimated  biases  in  (3  with  and  without  these  simulations  were 
only  0.006  and  0.005,  respectively,  approximately  the  level  of  the  Monte  Carlo  error. 
The  nonparametric  approach  exhibited  the  largest  estimated  biases  with  the  uniform 
distribution  and  cluster  size  of  four  (percent  bias  of  3.6%),  and  with  no  random 


180 


effect  and  a cluster  size  of  four  (percent  bias  of  2.8%).  Thus,  the  parametric  and 
nonparametric  approaches  behaved  similarly  in  regards  to  the  estimation  of  (3. 

Examining  the  standard  errors  for  /3,  we  see  that  there  is  close  agreement  between 
the  Monte  Carlo  estimates  and  the  model  based  estimates  for  the  nonparametric 
approach  as  well.  In  addition,  the  standard  errors  from  the  nonparametric  approach 
are  very  similar  to  those  obtained  from  the  parametric  approach.  We  see  that  even 
when  the  simulations  with  mass  points  at  plus  or  minus  infinity  are  included,  the 
model  based  standard  errors  still  perform  well.  When  broken  down  by  the  estimated 
support  size  K,  the  standard  errors  tend  to  increase  as  the  support  size  increases. 
This  was  also  seen  by  Follmann  and  Lambert  (1989).  Comparing  the  results  obtained 
for  the  normal  random  effect  (Tables  4.4  - 4.7)  and  the  2-point  discrete  random 
effect  (Tables  4.16  - 4.19),  we  see  that  being  far  from  the  true  support  size  of  the 
random  effects  distribution  does  not  adversely  effect  the  standard  error  estimates. 
This  further  supports  the  claim  by  Follmann  and  Lambert  (1989)  that  knowledge  of 
K is  unimportant  for  estimating  the  standard  errors. 

Estimation  of  the  remaining  parameters,  namely  the  thresholds  parameters  and 
the  variance  component,  is  considerably  less  accurate  for  the  nonparametric  approach. 
The  reason  for  this  is  that  these  parameters  are  functions  of  the  estimated  mass 
points.  Thus,  when  the  mass  points  tend  to  plus  or  minus  infinity,  the  estimates 
for  these  parameters  are  greatly  affected.  Even  when  the  offending  simulations  are 
removed,  the  estimates  for  the  variance  component  are  generally  not  very  good.  For 
most  of  the  distributions,  these  estimates  had  percent  biases  on  the  order  of  5%  to 
10%.  However,  they  were  nearly  20%  for  the  2-point  discrete  distribution  (Table  4.19) 
and  with  no  random  effect  (Table  4.22),  each  having  cluster  sizes  of  four.  Thus,  as 
has  been  cautioned  by  others,  one  should  not  place  too  much  faith  in  the  parameter 
estimates  that  are  functions  of  the  mixing  distribution  parameters. 


181 


Table  4.4:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  a normal  random  effect,  100  clusters,  and  cluster  size  of  seven.  NPML  results 
are  listed  by  estimated  support  size  and  averaged  across  all  support  sizes.  SE(/?)o 
and  SE ($)mc  denote  the  standard  error  of  /3  computed  from  the  observed  information 
matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


2 

Nonparametric  ML 

k 

3 4 5 

All 

Maximum 

Likelihood 

<*1 

0.006 

-0.001 

-0.016 

-0.149 

-0.003 

0.005 

-0.003 

0.009 

0.031 

-0.063 

0.008 

0.009 

P 

-0.007 

0.006 

-0.022 

-0.013 

0.000 

-0.002 

(SE  0M 

(0.076) 

(0.077) 

(0.076) 

(0.077) 

(0.076) 

(0.077) 

(SE0)mc) 

(0.070) 

(0.080) 

(0.074) 

(0.062) 

(0.078) 

(0.077) 

a2 

-0.084 

0.213 

1.3798 

2.415 

0.309 

-0.009 

Runs 

111 

326 

56 

7 

500 

500 

Examining  the  occurrences  of  mass  points  at  plus  or  minus  infinity,  we  see  that  it 
is  more  likely  to  occur  with  the  cluster  size  of  four.  This  was  to  be  expected  since  these 
occurrences  are  associated  with  clusters  having  all  the  same  responses.  This  obviously 
will  occur  more  often  when  the  cluster  sizes  are  small.  We  also  note  that  the  largest 
the  support  size  K was  estimated  to  be,  across  all  of  the  simulations,  was  five.  Even 
for  the  continuous  random  effects  distributions,  the  support  size  remained  between 
two  and  five.  For  the  2-point  discrete  distribution,  the  majority  of  the  simulations 
had  estimated  support  sizes  of  two,  however,  there  were  some  that  had  estimates  as 
high  as  four.  Likewise,  estimates  of  three  and  four  were  obtained  when  no  random 
effect  was  used. 

The  results  of  the  second  simulation  set  using  1,000  clusters  of  size  seven  are 
found  in  Table  4.23,  for  the  normal  random  effect,  and  Table  4.24  for  the  exponential 
random  effect.  Similar  patterns  are  seen  in  these  results,  but  with  generally  smaller 
estimated  biases.  It  is  interesting  to  note  in  Table  4.24  that  the  bias  estimate  for  the 
variance  component  under  the  parametric  approach  did  not  improve  when  compared 
with  the  simulations  using  100  clusters  (Table  4.12).  In  contrast,  the  nonparametric 
estimated  bias  improved  from  0.076  to  0.025.  We  also  see  that  increasing  the  number 


182 


Table  4.5:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  a normal  random  effect,  100  clusters,  and  cluster  size  of  seven,  excluding  simu- 
lations where  a mass  point  was  located  at  plus  or  minus  infinity.  NPML  results  are 
listed  by  estimated  support  size  and  averaged  across  all  support  sizes.  SE(/3)0  and 
SE(/?)mc  denote  the  standard  error  of  /?  computed  from  the  observed  information 
matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


2 

Nonparametric  ML 

k 

3 4 5 

All 

Maximum 

Likelihood 

Oil 

0.006 

-0.004 

-0.009 

-0.049 

-0.002 

0.005 

Ot2 

-0.003 

0.014 

0.035 

-0.062 

0.013 

0.009 

fi 

-0.007 

0.005 

-0.017 

-0.022 

0.000 

-0.002 

(SE  0)o) 

(0.076) 

(0.077) 

(0.077) 

(0.076) 

(0.077) 

(0.077) 

(SE0)mc) 

(0.070) 

(0.080) 

(0.071) 

(0.074) 

(0.077) 

(0.077) 

a 2 

-0.084 

0.033 

0.164 

0.133 

0.077 

-0.009 

Runs 

111 

321 

50 

5 

487 

500 

Table  4.6:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  a normal  random  effect,  100  clusters,  and  cluster  size  of  four.  NPML  results  are 
listed  by  estimated  support  size  and  averaged  across  all  support  sizes.  SE(/3)0  and 
SE(/?)mc  denote  the  standard  error  of  /5  computed  from  the  observed  information 
matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


Nonparametric  ML 

k 

All 

Maximum 

Likelihood 

1 

2 

3 

4 

5 

Oil 

0.099 

-0.004 

-0.021 

-0.089 

0.121 

-0.014 

0.000 

-0.027 

0.016 

0.006 

-0.034 

0.097 

0.009 

0.010 

P 

0.035 

0.000 

0.001 

-0.014 

-0.018 

0.000 

-0.004 

(SE(/5  )0) 

(0.097) 

(0.104) 

(0.103) 

(0.092) 

(0.107) 

(0.103) 

(0.104) 

(SE0)mc) 

(0.066) 

(0.105) 

(0.093) 

(0.130) 

(-) 

(0.100) 

(0.099) 

a2 

- 

0.916 

5.956 

20.301 

0.970 

3.915 

0.007 

Runs 

6 

245 

230 

18 

1 

500 

500 

183 


Table  4.7:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  a normal  random  effect,  100  clusters,  and  cluster  size  of  four,  excluding  simu- 
lations where  a mass  point  was  located  at  plus  or  minus  infinity.  NPML  results  are 
listed  by  estimated  support  size  and  averaged  across  all  support  sizes.  SE(/3)o  and 
SE(/3)mc  denote  the  standard  error  of  /?  computed  from  the  observed  information 
matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


Nonparametric  ML 

k 

All 

Maximum 

Likelihood 

1 

2 

3 

4 

5 

Oil 

0.099 

-0.013 

0.000 

-0.06 

0.121 

-0.006 

0.000 

«2 

-0.027 

0.008 

0.021 

0.089 

0.097 

0.013 

0.010 

P 

0.035 

-0.001 

0.005 

-0.034 

-0.018 

0.002 

-0.004 

(SE  0)o) 

(0.097) 

(0.104) 

(0.106) 

(0.110) 

(0.107) 

(0.104) 

(0.104) 

(SE0)mc) 

(0.066) 

(0.106) 

(0.093) 

(0.052) 

(-) 

(0.100) 

(0.099) 

a2 

- 

-0.032 

0.106 

0.365 

0.107 

0.052 

0.007 

Runs 

6 

243 

146 

3 

1 

399 

500 

Table  4.8:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  a uniform  random  effect,  100  clusters,  and  cluster  size  of  seven.  NPML  results 
are  listed  by  estimated  support  size  and  averaged  across  all  support  sizes.  SE(/3)0 
and  SE (P)mc  denote  the  standard  error  of/?  computed  from  the  observed  information 
matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


2 

Nonparametric  ML 

k 

3 4 5 

All 

Maximum 

Likelihood 

Oil 

0.003 

0.070 

0.167 

0.438 

0.056 

0.013 

Oi2 

0.009 

0.075 

0.165 

0.530 

0.061 

0.009 

P 

0.003 

0.007 

0.000 

0.169 

0.006 

0.003 

(SE(/3  )0) 

(0.079) 

(0.079) 

(0.076) 

(0.083) 

(0.079) 

(0.079) 

(SE(/?)mc) 

(0.081) 

(0.086) 

(0.071) 

(-) 

(0.084) 

(0.083) 

cr2 

-0.030 

1.344 

4.245 

10.202 

1.133 

0.012 

Runs 

172 

285 

42 

1 

500 

500 

184 


Table  4.9:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  a uniform  random  effect,  100  clusters,  and  cluster  size  of  seven,  excluding  sim- 
ulations where  a mass  point  was  located  at  plus  or  minus  infinity.  NPML  results  are 
listed  by  estimated  support  size  and  averaged  across  all  support  sizes.  SE(/3)0  and 
SE ($)mc  denote  the  standard  error  of  /?  computed  from  the  observed  information 
matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


2 

Nonparametric  ML 

k 

3 4 All 

Maximum 

Likelihood 

Oil 

0.003 

0.012 

0.006 

0.010 

0.013 

<*2 

0.009 

0.022 

-0.010 

0.016 

0.009 

P 

0.003 

0.007 

0.001 

0.005 

0.003 

(SE  (P)o) 

(0.079) 

(0.080) 

(0.080) 

(0.079) 

(0.079) 

(SE  0)MC) 

(0.081) 

(0.088) 

(0.079) 

(0.085) 

(0.083) 

a2 

-0.030 

0.059 

0.080 

0.031 

0.012 

Runs 

172 

252 

25 

449 

500 

Table  4.10:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  a uniform  random  effect,  100  clusters,  and  cluster  size  of  four.  NPML  results  are 
listed  by  estimated  support  size  and  averaged  across  all  support  sizes.  SE(/3)0  and 
SE(/?)M(7  denote  the  standard  error  of  /3  computed  from  the  observed  information 
matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


1 

Nonparametric  ML 

k 

2 3 4 

All 

Maximum 

Likelihood 

Oil 

0.029 

0.314 

0.228 

1.339 

0.305 

0.013 

Oil 

-0.230 

0.301 

0.256 

1.369 

0.309 

0.006 

P 

0.035 

0.013 

0.025 

0.008 

0.018 

0.013 

(SE(/9)0) 

(0.099) 

(0.107) 

(0.108) 

(0.111) 

(0.107) 

(0.108) 

(SV0)mc) 

(0.091) 

(0.104) 

(0.119) 

(0.097) 

(0.111) 

(0.108) 

a2 

- 

15.893 

6.389 

215.338 

17.325 

0.021 

Runs 

2 

269 

215 

14 

500 

500 

185 


Table  4.11:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  a uniform  random  effect,  100  clusters,  and  cluster  size  of  four,  excluding  simu- 
lations where  a mass  point  was  located  at  plus  or  minus  infinity.  NPML  results  are 
listed  by  estimated  support  size  and  averaged  across  all  support  sizes.  SE(/3)0  and 
SE(/3)mc  denote  the  standard  error  of  ft  computed  from  the  observed  information 
matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


1 

Nonparametric  ML 

k 

2 3 4 

All 

Maximum 

Likelihood 

0.029 

0.019 

0.001 

0.126 

0.011 

0.013 

a2 

-0.230 

0.009 

0.055 

-0.069 

0.023 

0.006 

ft 

0.035 

0.014 

0.035 

0.013 

0.021 

0.013 

(SE  0)o) 

(0.099) 

(0.107) 

(0.110) 

(0.113) 

(0.108) 

(0.108) 

(SE(/5)mc) 

(0.091) 

(0.104) 

(0.111) 

(0.146) 

(0.107) 

(0.108) 

cr2 

- 

0.107 

0.263 

0.480 

0.076 

0.021 

Runs 

2 

261 

126 

5 

394 

500 

Table  4.12:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  an  exponential  random  effect,  100  clusters,  and  cluster  size  of  seven.  NPML 
results  are  listed  by  estimated  support  size  and  averaged  across  all  support  sizes. 
SE(/?)o  and  SE(ft)MC  denote  the  standard  error  of  ft  computed  from  the  observed 
information  matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


Nonparametric  ML 

k 

All 

Maximum 

Likelihood 

1 

2 

3 

4 

5 

ai 

-0.013 

-0.022 

0.034 

0.184 

0.243 

0.037 

-0.019 

Oil 

-0.070 

-0.027 

0.049 

0.215 

0.254 

0.048 

-0.011 

ft 

0.048 

0.002 

0.007 

0.009 

0.023 

0.006 

0.003 

(SE  (ft)o) 

(0.073) 

(0.075) 

(0.076) 

(0.077) 

(0.077) 

(0.076) 

(0.077) 

(SE0)mc) 

(0.002) 

(0.073) 

(0.079) 

(0.068) 

(-) 

(0.076) 

(0.076) 

a2 

- 

-0.128 

1.159 

10.056 

0.699 

1.869 

-0.054 

Runs 

2 

136 

301 

60 

1 

500 

500 

186 


Table  4.13:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  an  exponential  random  effect,  100  clusters,  and  cluster  size  of  seven,  excluding 
simulations  where  a mass  point  was  located  at  plus  or  minus  infinity.  NPML  results 
are  listed  by  estimated  support  size  and  averaged  across  all  support  sizes.  SE(/3)0 
and  SFj((3)mc  denote  the  standard  error  of  ft  computed  from  the  observed  information 
matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


Nonparametric  ML 

k 

All 

Maximum 

Likelihood 

1 

2 

3 

4 

5 

a i 

-0.013 

-0.022 

-0.003 

-0.003 

0.243 

-0.008 

-0.019 

a 2 

-0.070 

-0.027 

0.014 

0.050 

0.254 

0.005 

-0.011 

P 

0.048 

0.002 

0.006 

0.016 

0.023 

0.006 

0.003 

(SE  0)o) 

(0.073) 

(0.075) 

(0.076) 

(0.078) 

(0.077) 

(0.076) 

(0.077) 

(SE(/?)mc) 

(0.002) 

(0.073) 

(0.080) 

(0.072) 

(-) 

(0.077) 

(0.076) 

a2 

- 

-0.128 

0.055 

0.224 

0.699 

0.076 

-0.054 

Runs 

2 

136 

276 

38 

1 

453 

500 

Table  4.14:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  an  exponential  random  effect,  100  clusters,  and  cluster  size  of  four.  NPML 
results  are  listed  by  estimated  support  size  and  averaged  across  all  support  sizes. 
SE(/3)o  and  SE(/3)j vrc  denote  the  standard  error  of  /3  computed  from  the  observed 
information  matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


1 

Nonparametric  ML 

k 

2 3 4 

All 

Maximum 

Likelihood 

Oil 

-0.038 

0.010 

0.140 

0.195 

0.071 

-0.023 

-0.120 

0.001 

0.188 

0.313 

0.090 

-0.017 

P 

-0.039 

-0.003 

0.017 

0.033 

0.006 

0.000 

(SE  (P)o) 

(0.095) 

(0.101) 

(0.100) 

(0.095) 

(0.101) 

(0.104) 

(SE(P)mc) 

(0.060) 

(0.104) 

(0.105) 

(0.142) 

(0.106) 

(0.105) 

a2 

- 

0.491 

7.011 

42.637 

4.688 

-0.060 

Runs 

7 

263 

213 

17 

500 

500 

187 


Table  4.15:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  an  exponential  random  effect,  100  clusters,  and  cluster  size  of  four,  excluding 
simulations  where  a mass  point  was  located  at  plus  or  minus  infinity.  NPML  results 
are  listed  by  estimated  support  size  and  averaged  across  all  support  sizes.  SE(/5)o 
and  SE ($)mc  denote  the  standard  error  of  / 3 computed  from  the  observed  information 
matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


1 

Nonparametric  ML 

k 

2 3 4 

All 

Maximum 

Likelihood 

Oil 

-0.038 

-0.010 

-0.034 

-0.031 

-0.018 

-0.023 

Oi2 

-0.120 

-0.015 

0.034 

0.144 

0.002 

-0.017 

0 

-0.039 

-0.002 

0.020 

0.049 

0.005 

0.000 

(SEG9)o) 

(0.095) 

(0.102) 

(0.106) 

(0.110) 

(0.104) 

(0.104) 

(SE(/?)mc) 

(0.060) 

(0.102) 

(0.111) 

(0.173) 

(0.107) 

(0.105) 

a2 

0.095 

-0.054 

0.192 

0.269 

0.020 

-0.060 

Runs 

7 

253 

116 

9 

385 

500 

Table  4.16:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  a two-point  discrete  random  effect,  100  clusters,  and  cluster  size  of  seven.  NPML 
results  are  listed  by  estimated  support  size  and  averaged  across  all  support  sizes. 
SE ($)o  and  SE((3)mc  denote  the  standard  error  of  j3  computed  from  the  observed 
information  matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


2 

Nonparametric  ML 

k 

3 4 All 

Maximum 

Likelihood 

C*i 

0.003 

-0.002 

0.033 

0.002 

0.010 

0.011 

0.011 

0.063 

0.014 

0.005 

P 

-0.001 

0.003 

0.028 

0.002 

0.000 

(SE(/3)0) 

(0.076) 

(0.077) 

(0.075) 

(0.076) 

(0.077) 

(SE0)mc) 

(0.074) 

(0.074) 

(0.077) 

(0.074) 

(0.075) 

a2 

-0.005 

1.060 

1.211 

0.497 

0.016 

Runs 

268 

207 

25 

500 

500 

188 


Table  4.17:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  a two-point  discrete  random  effect,  100  clusters,  and  cluster  size  of  seven,  ex- 
cluding simulations  where  a mass  point  was  located  at  plus  or  minus  infinity.  NPML 
results  are  listed  by  estimated  support  size  and  averaged  across  all  support  sizes. 
SE(/?)o  and  SE(/3)mc  denote  the  standard  error  of  (3  computed  from  the  observed 
information  matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


2 

Nonparametric  ML 

k 

3 4 All 

Maximum 

Likelihood 

«1 

0.003 

-0.001 

0.003 

0.001 

0.010 

0.011 

0.014 

0.033 

0.013 

0.005 

P 

-0.001 

0.001 

0.020 

0.001 

0.000 

(SE(/3)o) 

(0.076) 

(0.077) 

(0.078) 

(0.077) 

(0.077) 

(SE  0)MC) 

(0.074) 

(0.074) 

(0.075) 

(0.074) 

(0.075) 

a2 

-0.005 

0.075 

0.142 

0.034 

0.016 

Runs 

268 

200 

23 

491 

500 

Table  4.18:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  a two-point  discrete  random  effect,  100  clusters,  and  cluster  size  of  four.  NPML 
results  are  listed  by  estimated  support  size  and  averaged  across  all  support  sizes. 
SE(/3)o  and  SE(/3)MC  denote  the  standard  error  of  /3  computed  from  the  observed 
information  matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


1 

Nonparametric  ML 

k 

2 3 4 

All 

Maximum 

Likelihood 

Oil 

0.010 

0.005 

-0.031 

-0.088 

-0.011 

0.003 

-0.016 

0.011 

0.021 

-0.053 

0.013 

0.005 

P 

0.168 

0.012 

0.016 

-0.019 

0.013 

0.008 

(SE  CP)o) 

(0.100) 

(0.104) 

(0.104) 

(0.107) 

(0.104) 

(0.105) 

(SE  (P)MC) 

(0.217) 

(0.103) 

(0.109) 

(0.127) 

(0.107) 

(0.104) 

a2 

- 

0.022 

3.999 

11.305 

1.783 

0.041 

Runs 

2 

302 

182 

14 

500 

500 

189 


Table  4.19:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  a two-point  discrete  random  effect,  100  clusters,  and  cluster  size  of  four,  ex- 
cluding simulations  where  a mass  point  was  located  at  plus  or  minus  infinity.  NPML 
results  are  listed  by  estimated  support  size  and  averaged  across  all  support  sizes. 
SE(/3)0  and  SE(/3)^c  denote  the  standard  error  of  /5  computed  from  the  observed 
information  matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


1 

Nonparametric  ML 

k 

2 3 4 

All 

Maximum 

Likelihood 

Oil 

0.010 

0.005 

-0.045 

-0.098 

-0.010 

0.003 

Oi2 

-0.016 

0.011 

0.020 

-0.081 

0.013 

0.005 

P 

0.168 

0.012 

0.014 

0.035 

0.013 

0.008 

mho) 

(0.100) 

(0.104) 

(0.107) 

(0.108) 

(0.105) 

(0.105) 

(SE  0)mc) 

(0.217) 

(0.103) 

(0.106) 

(0.197) 

(0.106) 

(0.104) 

a 2 

- 

0.022 

0.274 

0.162 

0.094 

0.041 

Runs 

2 

302 

126 

5 

435 

500 

Table  4.20:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  no  random  effect,  100  clusters,  and  cluster  size  of  seven.  NPML  results  are 
listed  by  estimated  support  size  and  averaged  across  all  support  sizes.  SE(/5)o  and 
SE(/3)jvrc  denote  the  standard  error  of  /?  computed  from  the  observed  information 
matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


1 

Nonparametric  ML 

k 

2 3 4 

All 

Maximum 

Likelihood 

Oil 

-0.002 

-0.013 

-0.048 

-0.029 

-0.010 

-0.007 

«2 

0.008 

0.018 

0.047 

-0.025 

0.015 

0.012 

P 

-0.002 

0.003 

0.018 

0.069 

0.001 

0.000 

mho) 

(0.072) 

(0.073) 

(0.075) 

(0.074) 

(0.073) 

(0.073) 

mpuc) 

(0.076) 

(0.073) 

(0.076) 

(-) 

(0.075) 

(0.075) 

a2 

- 

0.080 

0.145 

0.079 

0.046 

0.029 

Runs 

228 

248 

23 

1 

500 

500 

190 


Table  4.21:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  no  random  effect,  100  clusters,  and  cluster  size  of  four.  NPML  results  are  listed 
by  estimated  support  size  and  averaged  across  all  support  sizes.  SE(/5)0  and  SE(/3)mc 
denote  the  standard  error  of  /3  computed  from  the  observed  information  matrix  and 
from  the  Monte  Carlo  estimates  across  all  simulations. 

Nonparametric  ML 


1 

K 

2 

3 

All 

Maximum 

Likelihood 

Oil 

-0.010 

-0.034 

-0.071 

-0.024 

-0.024 

-0.005 

0.047 

0.106 

0.024 

0.014 

P 

0.002 

0.023 

0.062 

0.014 

0.012 

(SE  0)o) 

(0.096) 

(0.099) 

(0.105) 

(0.098) 

(0.099) 

(SE(/?)MC) 

(0.106) 

(0.105) 

(0.100) 

(0.106) 

(0.106) 

a2 

- 

1.258 

1.090 

0.665 

0.056 

Runs 

234 

253 

13 

500 

500 

Table  4.22:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  no  random  effect,  100  clusters,  and  cluster  size  of  four,  excluding  simulations 
where  a mass  point  was  located  at  plus  or  minus  infinity.  NPML  results  are  listed  by 
estimated  support  size  and  averaged  across  all  support  sizes.  SE(/3)o  and  SE(/3)jvfc 
denote  the  standard  error  of  /?  computed  from  the  observed  information  matrix  and 
from  the  Monte  Carlo  estimates  across  all  simulations. 


1 

Nonparametric  ML 

k 

2 3 All 

Maximum 

Likelihood 

Oil 

-0.010 

-0.046 

-0.121 

-0.030 

-0.024 

Oi2 

-0.005 

0.040 

0.090 

0.019 

0.014 

P 

0.002 

0.025 

0.057 

0.014 

0.012 

(SBCS)o) 

(0.096) 

(0.100) 

(0.105) 

(0.098) 

(0.099) 

(SE(/?)mc) 

(0.106) 

(0.104) 

(0.105) 

(0.106) 

(0.106) 

cr2 

- 

0.176 

0.418 

0.098 

0.056 

Runs 

234 

224 

11 

469 

500 

191 


of  clusters  from  100  to  1,000  did  not  change  the  range  of  the  estimated  support  size. 
Between  the  two  simulations,  only  one  dataset  resulted  in  a support  size  greater  than 
four.  This  adds  further  support  to  the  conjecture  that  the  estimated  support  size  for 
single  random  effect  models  will  generally  be  small. 

We  can  draw  a number  of  conclusions  based  on  the  results  of  these  simulations. 
First,  both  the  parametric  and  nonparametric  approaches  exhibited  similar  biases 
when  estimating  the  regression  parameter  /3.  In  addition,  the  standard  errors  obtained 
for  /?  from  the  observed  information  matrix  were  in  close  agreement  with  the  Monte 
Carlo  estimates  for  both  approaches.  They  also  tended  to  have  similar  behavior 
under  the  various  random  effects  distributions  and  cluster  sizes,  both  having  largest 
estimated  biases  for  clusters  sizes  of  four  with  the  uniform  distribution  and  no  random 
effect.  For  estimation  of  the  remaining  parameters,  the  parametric  approach  was 
much  more  reliable.  The  nonparametric  approach  did  not  provide  very  accurate 
estimates  of  the  thresholds  or  the  variance  component.  This  is  mainly  due  to  mass 
points  being  estimated  at  plus  or  minus  infinity.  In  general,  the  parametric  approach 
had  small  estimated  bias  in  the  thresholds  and  variance  component,  except  when 
the  random  effects  distribution  was  extremely  skewed.  Thus,  if  one  is  interested  in 
estimation  of  the  thresholds  or  variance  component,  the  parametric  approach  will 
generally  provide  more  accurate  estimates.  For  the  estimation  of  /3,  however,  both 
approaches  will  generally  yield  similar  estimates. 

4.5.2  Simulation  Study  II 

As  noted  in  Section  4.2.2,  the  asymptotic  theory  needed  for  making  inferences 
in  the  NPML  approach  is  still  unknown.  As  a result  of  this,  a number  of  authors 
have  relied  on  the  standard  techniques  of  maximum  likelihood  inference  for  the  non- 
parametric models,  assuming  that  these  approaches  would  be  approximately  correct 
(Davies  1987;  Davies  and  Pickles  1987;  Aitkin  1996,  1999).  Davies  (1987)  provided 


192 


Table  4.23:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  a normal  random  effect,  1,000  clusters,  and  cluster  size  of  seven.  NPML  results 
are  listed  by  estimated  support  size  and  averaged  across  all  support  sizes.  SE(/3)0 
and  SE  (P)mc  denote  the  standard  error  of  P computed  from  the  observed  information 
matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


2 

Nonparametric  ML 

k 

3 4 5 

All 

Maximum 

Likelihood 

a i 

-0.013 

0.003 

0.011 

0.016 

0.005 

0.003 

0.002 

0.007 

0.001 

0.013 

0.005 

0.004 

P 

0.011 

0.000 

0.003 

0.032 

0.001 

0.000 

(SE  (P)0) 

(0.025) 

(0.025) 

(0.025) 

(0.025) 

(0.025) 

(0.025) 

(SE(/3)mc) 

(0.011) 

(0.025) 

(0.025) 

(-) 

(0.025) 

(0.023) 

a2 

-0.091 

-0.010 

-0.001 

0.035 

-0.009 

0.000 

Runs 

2 

66 

31 

1 

100 

100 

Table  4.24:  Estimated  bias  of  parameter  estimates  using  NPML  and  ML  estimation 
with  an  exponential  random  effect,  1,000  clusters,  and  cluster  size  of  seven.  NPML 
results  are  listed  by  estimated  support  size  and  averaged  across  all  support  sizes. 
SE(/3)o  and  SE (P)mc  denote  the  standard  error  of  P computed  from  the  observed 
information  matrix  and  from  the  Monte  Carlo  estimates  across  all  simulations. 


2 

Nonparametric  ML 

k 

3 4 All 

Maximum 

Likelihood 

ai 

-0.015 

-0.002 

0.021 

0.002 

-0.018 

Oil 

-0.022 

0.001 

0.029 

0.006 

-0.009 

P 

-0.008 

-0.008 

0.005 

-0.007 

-0.003 

(SE  0)o) 

(0.025) 

(0.025) 

(0.018) 

(0.025) 

(0.025) 

(SE(/3)MC) 

(0.021) 

(0.023) 

(0.025) 

(0.022) 

(0.025) 

a2 

-0.112 

-0.021 

0.393 

0.025 

-0.053 

Runs 

6 

73 

21 

100 

100 

193 


some  evidence  to  support  this  approach  in  a simulation  study  comparing  the  rejec- 
tion rates  of  the  likelihood-ratio  test  between  the  NPML  approach  and  a parametric 
approach.  He  found  that  the  likelihood-ratio  test  using  the  NPML  approach  had  sim- 
ilar type  I error  rate  and  power  when  compared  with  the  likelihood-ratio  test  using  a 
negative  binomial  model.  His  results  have  been  the  main  justification  for  the  use  of 
the  likelihood-ratio  statistic  in  the  NPML  models  (Aitkin  1996,  1999). 

In  this  section  we  report  the  results  of  a simulation  study  that  further  investigates 
the  use  of  standard  inferential  procedures  in  the  NPML  models.  In  contrast  to  Davies 
(1987),  we  examine  the  performance  of  both  the  likelihood-ratio  test  and  the  Wald 
test  for  the  NPML  approach  as  compared  with  the  parametric  approach  of  Chapter 
3.  Specifically,  we  simulated  data  from  a cumulative  logit  model  with  linear  predictor 

Vijr  = &r  + p Xij  + Ui,  r = 1,  • • • , R - 1,  (4.38) 

where  the  number  of  response,  R , was  three,  au  = -1,  a2  = .5,  and  x{j,  j = 
1,  • • • , 7,  i = 1,  • • • , 100,  was  drawn  from  a standard  normal  distribution.  Two  sets 
of  simulations  were  run  in  which  we  varied  the  true  distribution  of  the  random  effect, 
Uj.  For  the  first  simulation  set,  we  assumed  that  Ui  ~ N[0,a2  = .8),  whereas  in  the 
second  we  assumed  that  ~ Exp(  1).  As  in  the  first  simulation  study,  we  scaled  the 
exponential  distribution  to  have  the  same  mean  and  variance  as  the  normal  distribu- 
tion. For  each  simulation  set,  we  sampled  500  datasets  at  each  of  (3  = (0,  .2,  .4,  .6)  and 
fit  both  the  NPML  model  and  the  cumulative  logistic-normal  model  using  adaptive 
Gauss-Hermite  quadrature  to  approximate  the  normal  integral.  Simulation  sizes  of 
500  provided  Monte  Carlo  error  estimates  for  the  parameters  of  less  than  0.01.  To 
calculate  the  likelihood-ratio  test  statistic  for  each  run,  it  was  necessary  to  fit  models 
with  and  without  /?  to  obtain  the  log-likelihood  values  under  the  null  and  alternative 
hypotheses.  The  Wald  statistic  was  obtained  at  each  run  by  calculating  the  ratio  of 
the  square  of  / 3 and  the  variance  of  /5,  obtained  from  the  observed  information  matrix. 


194 


For  each  simulated  dataset,  the  test  of  the  null  hypothesis,  H0  : (3  = 0,  was  carried 
out  for  the  likelihood-ratio  test  and  the  Wald  test.  The  calculation  of  starting  values 
and  the  number  of  quadrature  points  that  were  used  in  the  first  simulation  study 
were  also  applied  here. 

Table  4.25  contains  the  estimated  type  I error  rates  for  the  Wald  and  likelihood- 
ratio  tests  for  both  estimating  approaches,  under  normal  and  exponential  random 
effects  assumptions.  It  is  clear  from  the  results  that  the  NPML  and  ML  approaches 
are  in  close  agreement  with  respect  to  their  estimated  type  I error  rates.  For  the 
normal  random  effect,  both  tests  for  both  approaches  exhibit  estimated  type  I error 
rates  close  to  the  nominal  level  with  no  discernable  patterns.  For  the  exponential 
random  effect,  both  tests  tended  to  have  larger  estimated  type  I error  rates  than  the 
nominal  level.  This  held  irregardless  of  the  estimation  approach.  The  Monte  Carlo 
error  for  the  estimated  Wald  and  Likelihood-ratio  tests  was  approximately  0.1.  Thus 
the  overestimation  of  the  type  I error  rates  for  the  exponential  distribution  may  be 
spurious. 

Table  4.26  contains  the  estimated  power  for  the  Wald  and  likelihood-ratio  tests 
when  testing  f3  = .2,  .4,  and  .6.  Again  we  see  very  similar  power  between  the  para- 
metric and  nonparametric  approaches  for  both  tests.  We  also  see  a marginal  increase 
in  the  power  when  we  move  from  the  normal  random  effect  to  the  exponential  random 
effect,  which  could  be  explained  by  Monte  Carlo  error.  The  close  agreement  found 
in  Tables  4.25  and  4.26  between  the  NPML  and  ML  approaches  suggests  that  use  of 
the  standard  asymptotic  inferential  tools  is  reasonable  for  the  NPML  approach.  We 
can  also  conclude  that  the  Wald  test  can  be  used  in  lieu  of  the  likelihood-ratio  test 
as  they  seem  to  behave  similarly  for  the  NPML  approach. 

As  discussed  before,  the  asymptotic  theory  regarding  the  likelihood-ratio  test  is 
known  to  break  down  when  testing  involves  parameters  on  the  boundary  of  the  pa- 
rameter space.  This  can  occur  with  the  testing  of  a fixed  parameter  using  the  NPML 


195 


Table  4.25:  Estimated  type  I error  rates  for  the  Wald  and  likelihood-ratio  (LRT) 
tests  for  testing  H0  : /3  — 0 in  a cumulative  logit  random  intercept  model.  Models 
were  fitted  using  both  the  NPML  and  ML  approaches  with  both  a normal  and  an 
exponential  random  effect  distribution. 


Normal 

Exponential 

LEVEL 

TEST 

NPML 

ML 

NPML 

ML 

0.10 

LRT 

WALD 

0.095 

0.083 

0.090 

0.088 

0.121 

0.119 

0.116 

0.116 

0.05 

LRT 

WALD 

0.048 

0.054 

0.052 

0.050 

0.063 

0.061 

0.060 

0.060 

0.01 

LRT 

WALD 

0.006 

0.006 

0.006 

0.006 

0.016 

0.016 

0.012 

0.012 

Table  4.26:  Estimated  power  for  the  Wald  and  likelihood-ratio  (LRT)  tests  for  testing 
H o : fl  = .2,  .4,  .6  in  a cumulative  logit  random  intercept  model.  Models  were  fitted 
using  both  the  NPML  and  ML  approaches  with  both  a normal  and  an  exponential 
random  effect  distribution. 


P 

TEST: 

Normal 

Exponential 

LRT 

WALD 

LRT 

WALD 

LEVEL 

NPML 

ML 

NPML 

ML 

NPML 

ML 

NPML 

ML 

0.2 

0.001 

0.232 

0.226 

0.219 

0.218 

0.253 

0.254 

0.247 

0.244 

0.0001 

0.080 

0.080 

0.075 

0.072 

0.084 

0.078 

0.079 

0.078 

0.00001 

0.024 

0.030 

0.021 

0.022 

0.032 

0.034 

0.023 

0.032 

0.4 

0.001 

0.956 

0.960 

0.954 

0.956 

0.968 

0.964 

0.959 

0.964 

0.0001 

0.862 

0.866 

0.857 

0.866 

0.900 

0.886 

0.886 

0.868 

0.00001 

0.776 

0.774 

0.757 

0.754 

0.751 

0.762 

0.735 

0.726 

0.6 

0.001 

1.000 

1.000 

0.996 

1.000 

0.973 

1.000 

1.000 

1.000 

0.0001 

1.000 

1.000 

0.996 

1.000 

0.955 

1.000 

1.000 

1.000 

0.00001 

1.000 

1.000 

0.996 

1.000 

0.940 

0.998 

0.998 

0.998 

196 


Table  4.27:  Estimated  type  I error  rates  for  the  Wald  and  likelihood-ratio  (LRT)  tests 
for  testing  H0  : /?  = 0 in  a cumulative  logit  random  intercept  model.  These  results 
include  only  the  simulations  that  resulted  in  a different  support  size  between  the 
null  and  alternative  model  for  the  likelihood-ratio  test.  The  numbers  in  parentheses 
denote  the  number  of  simulations  for  each  distribution. 


Normal  (34)  Exponential  (35) 


LEVEL 

TEST 

NPML 

ML 

NPML 

ML 

0.10 

LRT 

0.294 

0.265 

0.200 

0.229 

WALD 

0.226 

0.235 

0.171 

0.229 

0.05 

LRT 

0.088 

0.118 

0.086 

0.114 

WALD 

0.161 

0.088 

0.086 

0.114 

0.01 

LRT 

0.000 

0.000 

0.029 

0.029 

WALD 

0.000 

0.000 

0.029 

0.029 

algorithm  if  the  support  size  changes  under  the  null  and  alternative  hypotheses.  One 
does  not  have  control  over  when  this  will  occur,  thus  it  is  difficult  to  examine  this  by 
simulation.  However,  we  examined  the  data  obtained  from  the  various  simulations 
and  subsetted  the  results  into  those  tests  that  did  have  a differing  support  size  be- 
tween hypothesis.  We  do  not  attempt  to  make  any  strong  conclusions  from  this  data 
as  the  sample  sizes  are,  in  general,  not  that  large.  The  results  are  given  in  Tables  4.27 
and  4.28.  Indeed  in  Table  4.27  the  sample  sizes  for  the  normal  and  exponential  distri- 
bution cases  were  only  34  and  35,  thus  it  is  difficult  to  make  any  valid  conclusions.  In 
general  it  seems  that  the  likelihood-ratio  test  for  the  NPML  model  is  performing  sim- 
ilarly to  the  Wald  test,  and  similar  results  seem  to  be  obtained  between  approaches 
as  well.  In  Table  4.28  we  also  see  similar  rejection  rates  both  between  tests  and  be- 
tween approaches.  We  note  that  rejection  rates  are  again  seen  to  be  higher  for  the 
exponential  distribution  when  compared  with  the  normal  distribution.  Tentatively 
we  conclude  that  the  likelihood-ratio  test  can  provide  approximate  inferences  even 
when  the  support  sizes  differ  between  the  null  and  alternative  hypotheses. 

The  nonparametric  maximum  likelihood  approach  for  modeling  clustered  nominal 
and  ordinal  data  is  a viable  alternative  to  the  standard  parametric  approach  discussed 


197 


Table  4.28:  Estimated  power  for  the  Wald  and  likelihood-ratio  (LRT)  tests  for  testing 
H0  : fi  = -2,  .4,  .6  in  a cumulative  logit  random  intercept  model.  These  results  include 
only  the  simulations  that  resulted  in  a different  support  size  between  the  null  and 
alternative  model  for  the  likelihood-ratio  test. 


Normal  Exponential 

TEST:  LRT  WALD  LRT  WALD 


p 

LEVEL 

NPML 

ML 

NPML 

ML 

NPML 

ML 

NPML 

ML 

N= 

=57 

N= 

=69 

0.2 

0.001 

0.246 

0.263 

0.246 

0.263 

0.319 

0.377 

0.333 

0.362 

0.0001 

0.088 

0.088 

0.105 

0.088 

0.159 

0.159 

0.152 

0.159 

0.00001 

0.035 

0.035 

0.035 

0.035 

0.029 

0.044 

0.030 

0.044 

N= 

113 

N= 

TOO 

0.4 

0.001 

0.956 

0.956 

0.956 

0.956 

0.990 

0.990 

0.970 

0.990 

0.0001 

0.894 

0.894 

0.894 

0.894 

0.940 

0.930 

0.920 

0.910 

0.00001 

0.850 

0.841 

0.814 

0.814 

0.820 

0.840 

0.800 

0.790 

N= 

150 

N= 

:252 

0.6 

0.001 

1.000 

1.000 

1.000 

1.000 

0.968 

1.000 

1.000 

1.000 

0.0001 

1.000 

1.000 

1.000 

1.000 

0.948 

1.000 

1.000 

1.000 

0.00001 

1.000 

1.000 

1.000 

1.000 

0.941 

0.996 

1.000 

0.996 

198 


in  Chapter  3.  The  EM  algorithm  proposed  in  Section  4.2  is  relatively  simple,  and, 
coupled  with  one  of  the  EM  accelerators,  can  be  quite  fast.  Though  asymptotic  maxi- 
mum likelihood  theory  does  not  exist  for  the  NPML  approach,  the  simulations  in  this 
section  suggest  that  standard  tests  can  provide  at  least  approximate  inference.  The 
decision  of  which  approach  to  use  will  depend  on  a number  of  factors.  If  one  is  inter- 
ested in  estimation  of  the  regression  coefficients,  the  first  simulation  study  suggests 
that  both  approaches  will  yield  similar  estimates  and  standard  errors.  For  estimation 
of  the  mixing  distribution  and  threshold  parameters,  however,  the  NPML  approach 
can  be  quite  poor  if  a mass  point  is  located  at  plus  or  minus  infinity.  It  is  evident  from 
the  first  simulation  study  that  the  parametric  approach  is  quite  robust  to  departures 
from  normality.  Thus  one  might  consider  using  this  approach  all  the  time.  Indeed, 
we  have  found  few  occurrences  in  which  the  nonparametric  approach  has  provided 
substantially  different  results  from  the  parametric  approach.  Therefore,  we  view  the 
NPML  approach  as  an  additional  tool  for  random  effects  modeling.  It  can  be  used 
as  a check  for  the  normality  assumption  by  seeing  if  changes  occur  in  the  parame- 
ter estimates  when  the  random  effects  distribution  is  estimated  nonparametrically. 
Additional  research  in  the  areas  of  asymptotic  theory  for  inferential  procedures  and 
methods  for  testing  the  number  of  mass  points  is  still  needed,  however.  In  addition, 
software  for  implementing  the  NPML  approach  is  also  needed  to  allow  for  practical 
application  of  such  models. 


CHAPTER  5 

METHODS  FOR  ANALYZING  ORDINAL  MULTI-CENTER  CLINICAL  TRIAL 

DATA. 

5.1  Introduction 

In  this  chapter  we  consider  data  of  the  form  obtained  from  multi-center  clinical 
trials.  Clinical  trials  typically  involve  a comparison  of  a standard  treatment  versus  a 
new  treatment.  Often  the  outcome  is  some  binary  response  (success/failure).  Sub- 
jects are  randomly  assigned  to  one  of  the  two  treatment  groups,  and  the  success  rates 
of  the  two  treatments  are  compared  to  determine  which  treatment  is  more  effica- 
cious. An  important  feature  in  many  clinical  trials  is  that  the  trials  are  carried  out 
at  multiple  sites  or  centers.  Since  randomization  is  within  center,  patients  from  the 
same  center  often  have  similar  treatment  outcomes.  This  may  be  due,  for  example,  to 
unmeasured  variables  such  as  the  ability  of  the  staff  or  the  quality  of  the  equipment 
at  each  center.  Thus,  besides  just  a treatment  effect,  there  may  be  a center  effect 
influencing  the  efficacy  of  the  treatments.  Data  of  this  type  can  also  arise  in  the 
area  of  meta-analysis  where  one  combines  results  from  a number  of  smaller  studies. 
Such  analyses  are  used  to  summarize  conclusions  and  to  better  understand  sources 
of  between-study  variability. 

The  analysis  of  multi-center  clinical  trial  data  has  received  considerable  attention 
in  the  literature.  Indeed,  journals  such  as  Applied  Clinical  Trials  and  Controlled 
Clinical  Trials  are  dedicated  to  this  area  alone.  Much  attention  has  focused  on 
appropriate  ways  of  assessing  the  treatment  effect  while  accounting  for  the  possible 
heterogeneity  in  the  centers  and  possible  treatment-by-center  interaction.  Recent 
emphasis  has  been  on  the  use  of  frequentist  techniques,  such  as  random  effects  models 
(Agresti  and  Hartzel  1999),  and  Bayesian  techniques  (Jones  et  al.  1998)  for  modeling 


199 


200 


this  type  of  data.  The  majority  of  this  work,  however,  has  focused  on  clinical  trials 
with  binary  responses. 

Table  5.1  contains  part  of  a dataset  from  a multi-center  clinical  trial  conducted 
at  Merck  pharmaceutical  company.  The  table  contains  the  results  for  eight  centers 
from  a double-blind,  parallel-group  clinical  study.  The  purpose  of  the  study  was  to 
compare  a new  drug  to  a placebo  for  treatment  of  asthma.  Patients  were  randomly 
assigned  to  a treatment  and,  at  the  end  of  the  study,  their  change  in  asthma  condition 
was  evaluated.  The  response  scale  for  this  study  was  ordinal  with  possible  responses 
of  much  better,  better,  unchanged  or  worse.  To  correctly  estimate  the  new  drug 
effect  one  must  consider  the  effects  of  the  individual  centers  and  the  possibility  of 
nonconstant  treatment  effects  among  the  centers.  Indeed,  Fleiss  (1986)  wrote  “The 
most  challenging  questions  in  the  analysis  of  the  data  from  a multi-center  trial  are 
how  to  carry  out  the  analysis  when  there  is  a treatment-by-center  interaction,  and, 
prior  to  that,  how  to  ascertain  whether  such  interaction  exists.”  In  this  chapter  we 
consider  the  use  of  the  multinomial  random  effects  models  from  Chapters  3 and  4 for 
the  analysis  of  data  such  as  that  given  in  Table  5.1. 

Before  considering  the  random  effects  models,  we  first  briefly  discuss,  in  Section 
5.2,  fixed  effects  approaches  for  analyzing  ordinal  multi-center  data.  In  Section  5.3 
we  consider  ordinal  random  effects  models  that  allow  for  random  center  effects  as 
well  as  random  treatment-by-center  interactions.  In  most  multi-center  clinical  trials 
the  number  of  centers  is  only  small  to  moderate  in  size.  In  Section  5.4  we  report 
the  results  of  a simulation  study  aimed  at  assessing  the  performance  of  the  random 
interaction  model  when  the  number  of  centers  is  not  large.  To  statistically  determine 
if  the  random  interaction  term  is  needed,  we  propose  a score  test  in  Section  5.5,  based 
on  adaptive  Gauss-Hermite  quadrature,  for  testing  that  the  variance  component  for 
a correlated  random  effect  is  zero.  We  use  simulations  to  assess  the  performance  of 
the  proposed  score  test. 


201 


Table  5.1:  Asthma  multi-center  clinical  trial  data  from  Merck  pharmaceutical  com- 
pany comparing  a drug  to  a placebo. 

Response 

Much  Unchanged 


Center 

Treatment 

Better 

Better 

/Wor 

1 

Drug 

13 

7 

6 

Placebo 

1 

1 

10 

2 

Drug 

2 

5 

10 

Placebo 

2 

2 

1 

3 

Drug 

11 

23 

7 

Placebo 

2 

8 

2 

4 

Drug 

7 

11 

8 

Placebo 

0 

3 

2 

5 

Drug 

15 

3 

5 

Placebo 

1 

1 

5 

6 

Drug 

13 

5 

5 

Placebo 

4 

0 

1 

7 

Drug 

7 

4 

13 

Placebo 

1 

1 

11 

8 

Drug 

15 

9 

2 

Placebo 

3 

2 

2 

202 


5.2  Fixed  Effects  Approach 

Before  discussing  the  fixed  effects  approaches  for  modeling  ordinal  multi-center 
data,  we  first  consider  the  choice  of  a fixed  or  random  effects  approach.  Indeed,  Senn 
(1998)  labeled  this  issue  as  one  of  the  controversies  in  analyzing  multi-center  trials. 
In  certain  situations,  the  decision  of  a fixed  or  random  effects  approach  is  clear. 
For  example,  if  the  centers  have  been  randomly  sampled  and  one  has  an  interest 
in  estimating  the  center  effects,  a random  effects  approach  would  be  appropriate. 
Though  centers  are  typically  not  chosen  at  random,  Grizzle  (1987)  argued  that  “...the 
assumption  of  random  clinic  effect  will  result  in  tests  and  confidence  intervals  that 
better  capture  the  variability  inherent  in  the  system  more  realistically  than  when 
clinic  effects  are  considered  fixed.”  If,  on  the  other  hand,  there  are  very  few  centers 
or  the  centers  chosen  are  the  only  ones  of  interest,  then  a fixed  effects  analysis  would 
be  appropriate.  Unlike  the  random  effects  analysis,  however,  the  ability  to  estimate 
the  center  effects  in  the  fixed  effects  approach  will  depend  on  the  sparseness  of  the 
dataset. 

Senn  (1998)  listed  a number  of  pros  and  cons  for  using  fixed  or  random  effects. 
In  support  of  the  fixed  approach  he  argued  that  the  “center”  is  not  a well  defined 
experimental  unit  and  that  the  random  effects  model  gives  it  substantive  significance 
that  it  does  not  really  have.  He  also  argued  that  it  is  difficult  to  precisely  define  what 
question  the  random  effects  models  is  answering.  On  the  other  hand,  he  argued  that 
the  random  effects  approach  allows  one  to  borrow  information  from  all  other  centers 
when  making  inference  on  a particular  center.  It  also  provides  wider  and  hence  more 
realistic  standard  errors  and  confidence  limits  than  the  fixed  effects  analysis.  One 
can  make  strong  arguments  for  either  approach  as  they  each  have  their  place.  As  our 
main  focus  in  this  dissertation  is  the  application  of  random  effects  models,  we  only 
briefly  consider  the  fixed  effects  approach  here. 


203 


5.2.1  Maximum  Likelihood 

For  ordinal  multi-center  data,  such  as  that  given  in  Table  5.1,  the  primary  goal 
is  to  evaluate  the  efficacy  of  the  new  drug.  One  can  accomplish  this  through  the 
use  of  a multinomial  regression  model  for  ordinal  data  which  includes  an  association 
parameter  for  the  treatment  effect.  We  only  consider  multinomial  models  based  on 
the  logit  link;  however,  one  could  use  alternative  links,  such  as  the  probit  link,  as  well. 
As  before,  let  denote  the  multinomial  response  for  the  j th  treatment  in  the  ith 
center  with  multinomial  sample  size  ny,  j = 1,  2,  i = 1,  • • • , n.  A simple  model  for 
estimating  the  association  between  the  ordinal  response  and  the  treatment  covariate 
is 


Vijr  = ar-\-  ji  + pXj,  r = l,---,q,  (5.1) 

where  ar  denotes  the  rth  threshold  parameter,  7,  denotes  the  ith  center  effect,  and 
Xj  is  coded  one  for  the  new  drug  and  zero  for  the  placebo.  In  order  to  estimate 
the  n center  effects,  model  (5.1)  requires  a constraint  such  as  a 1 = 0.  Model  (5.1) 
assumes  a common  association  /3  holds  across  all  centers.  That  is,  it  assumes  that  a 
treatment-by-center  interaction  does  not  exists.  If  one  were  to  fit  model  (5.1)  using 
a cumulative  logit  link,  exp(/3)  would  denote  the  common  cumulative  odds  ratio. 
Alternatively  one  could  use  the  adjacent-category  logit  where  exp(/3)  would  denote 
the  common  odds  ratio  for  all  adjacent  pairs  of  responses.  Note  that  as  the  number 
of  centers  increase,  the  number  of  parameters  in  model  (5.1)  increases  as  well.  Thus 
we  see  that  the  fixed  effects  approach  is  useful  only  when  the  number  of  centers  is 
not  too  large,  relative  to  the  overall  sample  size. 

If  one  suspects  that  the  treatment  effect  may  vary  across  centers,  one  can  gener- 
alize model  (5.1)  to  allow  a separate  association  parameter  for  each  center.  This 


204 


heterogeneous  association  model  has  the  form 

rjijr  - OLr  +7i  + 0tXj,  r = 1,  • • • ,q.  (5.2) 

Model  (5.2)  assumes  common  threshold  parameters  across  all  centers,  but  allows  for 
varying  association  parameters.  A disadvantage  of  this  model  is  that  one  does  not 
obtain  a single  measure  for  describing  the  treatment  effect.  We  will  see  in  Section 
5.3  that  the  random  effects  approach  to  model  (5.2)  provides  a mean  treatment  effect 
along  with  an  estimate  of  its  variability. 

5.2.2  Mantel-Haenszel  Approach 

As  an  alternative  to  the  maximum  likelihood  estimate  of  f3  from  model  (5.1), 
Liu  and  Agresti  (1996)  proposed  a Mantel-Haenszel-type  estimator  for  estimating  a 
common  cumulative  odds  ratio  for  several  stratified  2 x R tables.  For  R = 2,  the 
Mantel-Haenszel  estimator  (Mantel  and  Haenszel  1959)  for  a common  odds  ratio  is 
consistent  in  both  large  and  sparse  sample  asymptotics.  That  is,  when  the  number 
of  centers  are  fixed  and  the  sample  size  within  each  center  become  large,  and  when 
the  number  of  centers  increases  proportionally  with  the  overall  sample  size.  In  con- 
trast, the  odds  ratio  estimate  obtained  from  model  (5.1)  is  inconsistent  under  sparse 
asymptotics.  Liu  and  Agresti  (1996)  showed  that  the  Mantel-Haenszel-type  estimator 
maintained  the  asymptotic  behavior  of  the  Mantel-Haenszel  estimator  and  had  little 
efficiency  loss  compared  to  the  maximum  likelihood  estimator  exp(/3)  from  model 
(5.1)  when  the  data  was  not  sparse. 

For  the  zth  center  with  multinomial  samples  and  yji  and  multinomial  sample 
sizes  Tin  and  ^i2i  respectively,  the  Mantel-Haenszel  estimator  of  the  common  cumu- 
lative odds  ratio  is 

E E y*nk(ni2  - yhk)/ni- 

r>r>  _ 1=1 

UtiMH  - -^—q , 

E E y<2fc(«ii  - y*k)/ni- 

i= 1 k= 1 


(5.3) 


205 


where  y*jk  = y^i  + • • • + yijk  and  rii.  — Tin  + «i2-  Liu  and  Agresti  (1996)  noted 
that  even  when  the  common  cumulative  odds  ratio  assumption  does  not  hold,  the 
estimator  (5.3)  provides  a useful  summary  if  the  heterogeneity  in  the  center-specific 
odds  ratios  is  not  large  and  the  center-specific  odd  ratios  are  in  the  same  direction. 
They  also  provide  standard  error  estimates  of  (5.3)  for  when  the  assumption  of  a 
common  cumulative  odds  ratio  holds  and  for  when  it  does  not  hold. 

The  Mantel-Haenszel  estimator  (5.3)  of  the  common  cumulative  odds  ratio  pro- 
vides a computationally  simple  alternative  to  the  corresponding  maximum  likelihood 
estimator,  one  that  is  also  consistent  under  sparse  asymptotics.  Liu  and  Agresti 
(1996)  recommended  the  use  of  (5.3)  over  the  maximum  likelihood  estimator  when 
the  sample  sizes  for  most  centers  is  five  or  less.  In  practice,  however,  the  assumption 
of  a common  association  across  all  centers  is  unrealistic.  Thus  methods  for  inves- 
tigating and  describing  such  heterogeneity  are  important.  In  the  next  section  we 
utilize  random  effects  to  allow  for  heterogeneity  in  the  cumulative  odds  ratios  and  to 
describe  the  magnitude  of  variability. 

5.3  Random  Effects  Approach 

Due  to  the  recent  advances  in  computing  power,  greater  availability  of  software, 
and  the  plethora  of  recent  literature  on  the  topic,  random  effects  modeling  is  being 
utilized  in  a wide  variety  of  experimental  situations  that  have  clustered  or  longitudinal 
data.  If  one  assumes  that  centers  in  a multi-center  clinical  trial  are  randomly  selected 
from  a population  of  centers,  then  a random  effects  approach  can  be  used  to  account 
for  heterogeneity  in  the  centers  and  in  the  associations  across  centers.  This  approach 
has  been  examined  by  a number  of  authors  when  the  response  from  the  clinical 
trial  is  binary  (see,  e.g.,  Agresti  and  Hartzel  1999).  Alternative  approaches  exist  for 
incorporating  heterogeneity  (see,  e.g.,  Skene  and  Wakefield  1990),  however  we  will 
concentrate  on  the  random  effects  approach. 


206 


As  the  research  on  random  effects  models  for  ordinal  data  has  generally  lagged 
behind  that  for  binary  data,  there  has  been  little  work  in  applying  such  models  to 
the  clinical  trial  setting.  Lindsey  et  al.  (1997)  considered  the  analysis  of  a seasonal 
rhinitis  (or,  more  commonly,  hay  fever)  clinical  trial.  Subjects  were  observed  for 
28  days  for  which  they  recorded  a symptom  response  scored  on  a 0-3  scale,  with 
0 being  no  symptom  and  3 being  bad.  The  structure  of  the  data  differs  from  the 
multi-center  data  given  in  Table  5.1,  but  does  exhibit  clustering  at  the  subject  level. 
Lindsey  et  al.  (1997)  proposed  a continuation-ratio  logit  model  that  conditioned  on 
the  previous  subject’s  response,  yielding  a form  of  Markov  chain.  Such  an  approach 
can  account  for  the  serial  correlation  in  time  for  a subject,  but  does  not  account 
for  heterogeneity  among  the  subjects.  A random  effects  approach  (as  discussed  in 
Chapter  3)  could  be  used  to  account  for  such  heterogeneity.  A complicating  factor 
in  the  dataset,  however,  was  that  it  consisted  of  416  patients  with  approximately 
10,650  total  observations.  Thus,  a random  effects  approach,  which  must  approximate 
integrals  for  each  subject,  would  certainly  be  computationally  burdensome  to  fit. 

In  this  section  we  consider  random  effects  models  for  modeling  ordinal  responses 
from  a multi-center  clinical  trial.  We  begin,  in  Section  5.3.1,  by  considering  the  sim- 
plest random  effects  model  which  allows  a shifting  in  thresholds  for  the  centers.  This 
model  assumes  that  a common  association  parameter  holds  for  all  centers.  We  then 
consider  a more  realistic  model  in  Section  3.5.2  that  allows  the  association  parame- 
ter to  vary  as  well.  This  heterogeneous  association  model  is  computationally  more 
complicated  as  it  includes  a random  effect  for  center  as  well  as  a random  center-by- 
treatment interaction.  For  both  models,  parametric  and  nonparametric  assumptions 
can  be  made  concerning  the  distribution  of  the  random  effects.  For  this  latter  as- 
sumption, we  discuss  in  Section  3.5.3  how  one  can  extend  the  NPML  approach  of 
Chapter  4 to  fit  the  heterogeneous  association  model,  as  well  as  general  multiple 


random  effects  models. 


207 


5.3.1  Homogeneous  Association 

In  the  previous  section  we  considered  the  fixed  effects  model  (5.1)  that  included  a 
separate  effect  for  each  center  and  a common  association  parameter  across  all  centers. 
When  the  number  of  centers  becomes  large  or  the  data  become  sparse  within  each 
center,  problems  can  occur  in  the  estimation  of  model  (5.1).  Indeed,  for  binary  data, 
Agresti  and  Hartzel  (1999)  showed  that  infinite  estimates  (in  absolute  value)  for  the 
center  effects  can  occur  in  the  latter  situation.  As  an  alternative  to  model  (5.1),  one 
can  treat  the  centers  as  if  they  were  a random  sample  from  a population  of  centers. 
This  amounts  to  allowing  for  a shifting  of  thresholds  by  center.  Thus  the  linear 
predictor  is  given  by 


Vijr  = ar  + /3 Xj  + Ui,  r = 1,  • • • , q,  (5.4) 

where  tq  denotes  the  random  effect  for  the  ith  center,  i = 1,  • • • , n,  and  the  remaining 
parameters  and  covariate  are  defined  as  in  Section  5.2.1. 

The  specification  of  the  homogeneous  random  effects  model  is  completed  by  spec- 
ifying the  distribution  of  the  random  effect  rq  and  by  choosing  a link  function.  As 
discussed  in  Chapters  3 and  4,  one  can  make  a parametric  or  nonparametric  assump- 
tion about  the  distribution  of  the  random  effect.  From  our  experience  and  from  the 
results  of  the  simulation  study  in  Section  4.5.1,  both  approaches  will  yield  similar 
estimates  for  the  association  parameter.  Regardless  of  the  choice,  the  algorithms 
given  in  Chapters  3 and  4 can  be  used  to  obtain  maximum  likelihood  estimates  of 
the  parameters.  The  choice  of  the  link  function  will  depend  on  the  form  of  the  or- 
dinal response  and  the  desired  interpretation.  For  example,  if  the  responses  have  a 
sequential  ordering,  then  the  continuation-ratio  logit  link  would  be  most  appropriate. 
For  the  data  given  in  Table  5.1,  the  cumulative  logit  or  adjacent-category  logit  link 
would  most  likely  be  chosen. 


208 


An  advantage  of  using  model  (5.4)  over  model  (5.1)  is  that  non-infinite  estimates 
of  the  center  effects  can  be  obtained,  even  when  the  data  are  sparse  within  the 
centers.  As  discussed  in  Section  3.4.3,  estimates  of  the  center  effects  in  model  (5.4) 
are  obtained  by  calculating  the  expected  values  of  the  {u,}  given  the  data  and  the 
final  parameter  estimates.  These  predictions  are  analogs  of  best  linear  unbiased 
predictors  (BLUP)  for  mixed  models  with  normal  responses.  As  the  prediction  for  a 
given  center  effect  is  a function  of  the  parameter  estimates  obtained  from  all  centers, 
the  predictions  “borrow”  information  from  all  centers.  Thus  non-infinite  estimates 
can  be  obtained  for  centers  for  which  % would  be  infinite  for  maximum  likelihood 
fitting  of  model  (5.1).  Center  estimates  can  be  obtained  for  both  the  parametric  and 
nonparametric  distributional  assumptions.  However,  due  to  the  discreteness  of  the 
random  effects  distribution,  the  set  of  possible  center  estimates  for  the  nonparametric 
approach  is  smaller  than  that  of  the  parametric  approach.  Thus,  the  parametric 
approach  is  more  well  suited  for  obtaining  such  predictions. 

Model  (5.4)  assumes  that  the  association  parameter  (3  is  the  same  across  all  cen- 
ters. It  also  assumes  that  the  distance  between  thresholds  is  the  same  for  all  centers. 
For  the  continuation-ratio  and  adjacent-category  logit  links,  one  could  generalize  (5.4) 
by  replacing  Ui  with  uir,  allowing  the  thresholds  to  vary  individually.  For  the  cumu- 
lative logit  link,  the  extended  model  of  Tutz  and  Hennevogl  (1996)  (see  Section  3.7) 
could  also  be  applied.  For  all  three  cases,  one  no  longer  has  a single  estimate  for 
the  center  effects.  Indeed,  for  the  extended  cumulative  logit  model,  the  predicted 
threshold  effects  would  have  little  meaning  as  they  do  not  correspond  to  the  origi- 
nal thresholds  (expect  for  the  first  threshold).  In  general,  such  models  would  most 
likely  provided  similar  estimates  of  f3  as  provided  by  model  (5.4).  A more  beneficial 
generalization  would  be  to  allow  the  association  parameter  to  vary  over  centers.  We 
consider  this  heterogeneous  association  model  in  the  next  section. 


209 


5.3.2  Heterogeneous  Association 

The  assumption  of  a common  association  parameter  for  all  centers  is  unrealistic. 
Indeed  one  would  expect  the  parameter  to  vary  at  least  nominally  due  to  variation  in, 
for  example,  equipment,  personnel,  or  patients  from  center  to  center.  One  can  extend 
the  homogeneous  random  effects  model  to  allow  for  a varying  association  parameter 
by  incorporating  a center-by-treatment  interaction  into  model  (5.4).  Since  the  center 
effects  are  already  assumed  to  be  random,  the  resulting  interaction  is  also  random. 
The  heterogeneous  random  effects  model  can  be  written  in  the  form 

Vijr  = ar  + fixj  + U{  + vi:j,  r = l,---,q,  j = 1,2,  (5.5) 

where  the  complete  random  effects  vector  for  the  ith  subject  is  u)  = (u,,  vn,  i>i2), 
i = 1,  • • • , n.  We  assume  that  u*  follows  a distribution  G with  mean  0 and  covariance 
matrix  £. 

The  form  of  the  heterogeneous  random  effects  model  (5.5)  has  been  considered 
previously  by  a number  of  authors.  Littell  et  al.  (1996)  used  this  model  to  analyze 
data  from  an  eight-center  clinical  trial  in  which  two  drugs  were  compared  with  respect 
to  binary  response.  Likewise,  Booth  and  Hobert  (1999)  analyzed  data  from  fourteen 
retrospective  studies  on  the  association  between  smoking  and  lung  cancer.  Such  data 
have  the  form  of  Table  5.1  with  center  numbers  replaced  by  study  numbers.  As 
noted  by  Booth  and  Hobert  (1999),  model  (5.5)  has  three  random  effects,  which, 
under  the  assumption  of  normality  for  the  random  effects,  means  that  three  integrals 
must  be  approximated  for  each  center.  In  fact,  a simple  reparameterization  of  the 
random  effects  in  (5.5)  can  reduce  the  number  of  intractable  integrals  to  two  for  each 
center.  This  simplification  was  used  by  Agresti  and  Hartzel  (1999)  for  analyzing 
binary  multi-center  data. 


210 


Reparameterization 

Let  the  random  effects  covariance  matrix  E for  model  (5.5)  have  the  unstructured 
form 


<^12 

°13 

°12 

^2 

<^23 

V 13 

T23 

Model  (5.5)  can  be  simplified  in  the  following  manner 


(5.6) 


Vijr  ~ar  + /3xj  + Ui  + v. 


» 3 


:cnr  4”  /3xj  4"  Ui  + 


Vil  4-  Vi2  Vil  4-  Vi2 


2 2 
—ar  4-  /3xj  + u*  4-  (-l)/b'=1^;, 


UIJ 


(5.7) 


where  u*  = Ui  4 


Vil  + vi2 


2 'ivt=  11  2 — anc^  I\j  = 1]  *s  an  indicator  function  which  is 
one  when  j = 1 and  zero  otherwise.  In  model  (5.7)  there  are  only  two  random  effects 
Uj  = (u*,  v* ) and  only  two  integrals  to  approximate.  The  new  covariance  matrix  E* 
is  of  the  form 


E*  = 


a 


2* 


a 


12 


'12 


cr. 


2* 


(5.8) 


It  is  informative  to  consider  the  relationship  between  the  elements  in  (5.8)  and 
the  elements  in  (5.6).  Using  the  definitions  of  u*  and  v*,  one  can  show  that 


'2. * 2 , 1 / 2 , 2\  , , ^ 2.A 

al  — al  + + a3)  + a12  + ~2~, 


J2* 


1 


— " (r r2  I _2\  °23 

— 7(^2  + a3) 


ai2  — 2^23  _ ai3)  + 4^2  ~ <4)- 


If,  for  example,  one  assumed  in  model  (5.5)  that  the  interaction  components  Vn  and 
vi2  were  normally  distributed  and  had  common  variances  and  covariances  (i.e.  erf  = erf 


211 


and  <Ji2  = o 13),  it  would  translate  into  assuming  that  u*  and  v*  are  independent  in 
model  (5.7)  (i.e.  a{2  = 0).  Since  model  (5.7)  requires  fewer  integrals  to  be  approxi- 
mated, it  is  advantageous  to  use  that  representation.  Thus,  for  the  remainder  of  the 
chapter  when  referring  to  the  heterogeneous  association  model,  we  will  be  referring 
to  model  (5.7).  Regardless  of  the  model  fit,  if  one  allows  both  (5.6)  and  (5.8)  to  be 
unstructured,  both  models  will  yield  the  same  estimates  and  log-likelihood  values. 
Estimation 

Estimation  of  the  heterogeneous  association  model  (5.7),  under  the  assumption 
of  normality  for  the  random  effects,  can  be  carried  out  using  any  of  the  algorithms 
discussed  in  Chapter  3.  Recall  that  the  general  multinomial  random  effects  model 
was  of  the  form  rj ^ = Zij/3  + PE^u,.  For  model  (5.7),  consists  of  the  appropriate 
design  matrix  for  the  chosen  response  function,  and  a column  consisting  of  a 1.0  or 
0.0  depending  on  the  status  of  the  treatment.  The  corresponding  parameter  vector 
is  /3  = (an,  • • • , otqj  (3) , and  the  random  effects  vector  is  u)  = (u*,v*).  The  random 
effects  design  matrix  Wij  has  a column  of  ones  for  the  random  center  effect  and  a 
column  consisting  of -1.0  or  1.0  depending  on  the  status  of  the  treatment. 

In  addition  to  estimating  the  average  association  (3  across  all  centers  and  its  vari- 
ability <7$*,  one  can  also  obtain  predictions  for  the  value  of  the  association  parameter 
for  each  center.  This  is  analogous  to  the  predicted  center  effects  discussed  in  the 
previous  section  for  model  (5.4).  To  accomplish  this,  one  would  predict  the  value  of 
v*  in  model  (5.7)  for  each  center,  and  then  use  the  estimated  value  of  /3  to  determine 
the  association  for  each  center.  If  one  were  using  a cumulative  logit  link  for  exam- 
ple, one  could  exponentiate  these  predictions  to  obtain  the  center-specific  cumulative 
odds  ratios. 

For  data  arising  from  a multi-center  clinical  trial,  the  number  of  centers  is  usually 
small.  Thus  it  is  questionable  whether  valid  estimates  can  be  obtained  for  the  co- 
variance  matrix  E*.  Indeed  for  Table  5.1  the  estimates  would  be  based  on  only  eight 


212 


observations  (centers).  In  Section  5.4  we  report  the  results  of  a simulation  study  in 
which  we  examined  the  performance  of  the  heterogeneous  association  model  under  the 
assumption  of  normality  for  the  random  effects.  As  an  alternative  to  normality,  one 
could  also  estimate  model  (5.7)  using  a nonparametric  approach.  In  the  next  section 
we  show  how  one  can  extend  the  NPML  model  of  Chapter  4 to  fit  the  heterogeneous 
association  model. 

5.3.3  Heterogeneous  Association:  NPML  Estimation 

In  Section  4.2  we  outlined  the  NPML  EM  algorithm  for  fitting  multinomial  random 
effects  models  with  shifted  thresholds.  With  only  slight  modifications,  this  algorithm 
can  also  be  used  to  fit  model  (5.7),  or,  more  generally,  models  with  multiple  random 
effects.  There  have  been  relatively  few  uses  of  the  NPML  approach  with  multiple 
random  effects  in  the  literature.  Davies  and  Pickles  (1987)  utilized  a bivariate  discrete 
random  effects  distribution  for  studying  shopping  travel.  For  their  particular  model, 
the  NPML  estimate  of  the  mixing  distribution  had  a surprisingly  large  estimated 
support  size  of  eighteen.  Davies  (1993)  applied  the  NPML  approach  to  the  modeling 
of  residual  heterogeneity  in  recurrent  behavior  data.  As  an  example,  he  considered 
a depression  dataset  in  which  the  depression  status  of  subjects  (depressed  or  not 
depressed)  was  recorded  at  each  of  four  consecutive  interviews.  Using  a first-order 
Markov  chain  model,  Davies  (1993)  assumed  that  the  transition  matrix  consisted  of 
two  subject-specific  transition  probabilities  which  he  modeled  with  a bivariate  discrete 
distribution.  For  his  application,  the  NPML  estimate  required  only  three  mass  points. 
Finally,  Aitkin  (1999)  discussed  how  one  could  fit  random  coefficient  models  using 
the  NPML  approach.  He  utilized  this  approach  to  analyze  a binary  multi-center 
clinical  trial  dataset  allowing  for  heterogeneous  associations.  Though  the  estimated 
discrete  bivariate  distribution  had  only  three  mass  points,  Aitkin  (1999)  noted  that, 
in  general,  more  mass  points  were  needed  for  the  bivariate  case.  From  our  experience, 
only  three  to  five  mass  points  are  typically  needed  for  the  bivariate  case  as  well. 


213 


For  a model  such  as  (5.7),  the  NPML  approach  assumes  that  the  joint  distribution 
of  (u*,v*)  is  a discrete  distribution  with  mass  points  (mki,mk2)  and  masses  pk,  k — 
1,  • • • , K,  where  K is  the  unknown  support  size.  For  the  more  general  model,  the 
likelihood  can  be  written  in  the  form 

n K 

L(P,  p,m)  = nz  Pkf(9i  | P,mk),  (5.9) 

t=i  fc=i 

where  /(y*  | /3,  mfc)  = IlJLi  / {fij  I /3,m k),  mk  = (mk i,  • • • ,mkm-),  and  m*  is  the 
dimension  of  the  random  effects.  Likelihood  (5.9)  is  a generalization  of  likelihood 
(4.2)  given  in  Section  4.2. 

Maximization  of  the  log  of  (5.9)  can  be  accomplished  using  the  NPML  EM  algo- 
rithm defined  in  Section  4.2.  For  model  (5.7)  we  now  have  K pairs  of  mass  points 
{mk i,  mk2)  to  estimate.  We  use  a similar  approach  for  incorporating  the  pairs  of  mass 
points  into  the  design  matrix  Zij  as  was  used  to  incorporate  the  single  mass  point  mk 
for  the  shifted  threshold  model.  Specifically,  in  addition  to  the  K level  factor  used 
in  the  shifted  threshold  model,  we  also  include  a second  factor  that  is  obtained  by 
interacting  the  first  factor  with  the  covariate  Xj.  Recall  that  the  K level  factor  is 
coded  as  a set  of  K dummy  variables.  Thus  the  covariate  Xj  is  multiplied  by  each  of 
the  K dummy  variables.  To  avoid  identifiability  issues  with  the  thresholds,  one  must 
exclude  two  of  the  design  columns  related  to  the  mass  point  factors. 

In  a similar  manner,  one  can  use  the  NPML  approach  for  fitting  models,  such  as 
the  continuation-ratio  logit  or  baseline-category  logit,  that  allow  for  varying  thresh- 
olds. Recall  that  in  these  models,  one  specifies  separate  regression  parameters  for 
each  logit.  To  allow  for  varying  thresholds,  one  simply  incorporates  a K level  factor 
for  each  logit.  To  identify  all  parameters,  one  must  either  suppress  the  intercept 
parameter  for  each  logit,  or  remove  one  of  the  dummy  variables  for  the  K level  mass 
point  factor.  For  the  toxicity  and  life  satisfaction  datasets  considered  in  Section  3.6, 
we  used  a parametric  random  effects  approach  to  fit  a continuation-ratio  logit  model 


214 


and  a baseline-category  logit  model  that  allowed  for  varying  thresholds.  By  assuming 
a bivariate  discrete  distribution  for  the  random  effects,  we  can  now  use  a nonpara- 
metric  random  effects  approach  to  fit  these  same  models.  To  this  end,  we  briefly 
reconsider  the  toxicity  and  life  satisfaction  datasets  originally  analyzed  in  Section 
3.6. 

Developmental  toxicity  data 

In  Section  3.6.2  we  analyzed  data  from  a toxicity  study  (Table  3.3),  in  which 
pregnant  mice  were  administered  one  of  four  dosages  (0,  0.75,  1.50,  3.00  g/kg)  of 
ethylene  glycol.  After  exposure,  their  fetuses  were  then  examined  for  defects,  and 
classified  as  either  Dead/Resorption,  Malformed,  or  Normal.  Results  from  Table  3.9 
suggested  that  a continuation-ratio  logit  model  allowing  separate  dosage  parameters 
for  the  two  logits,  which  model  the  probability  of  a dead/resorbed  fetus  and  the 
conditional  probability  of  a malformed  fetus  given  the  fetus  was  alive,  as  well  as 
random  threshold  parameters  was  adequate.  Thus,  the  linear  predictor  for  the  zth 
litter  and  the  rth  logit  is 

Tjir  OLr  T (3 DOr  %i  T Hiri  (5.10) 

where  rq  = (un,Ui2)  is  a bivariate  random  effect,  and  the  remaining  parameters  and 
covariate  are  defined  as  in  Section  3.6.2. 

Previously  we  assumed  that  tq  followed  a multivariate  normal  distribution  with 
mean  0 and  covariance  matrix  E.  We  now  relax  this  assumption  and  assume  that  u* 
follows  a bivariate  discrete  distribution  with  mass  points  (mfcl,rafc2)  and  masses 
k = 1,  • • • , K,  where  K is  the  unknown  support  size.  Table  5.2  contains  the  NPML 
estimate  for  model  (5.10)  along  with  the  results  from  the  corresponding  parametric 
model  which  allowed  for  correlation  between  thresholds.  Assuming  a discrete  joint 
distribution  for  un  and  ui2  automatically  incorporates  correlation  between  the  random 


215 


effects.  In  fact,  we  were  unable  to  fit  the  NPML  model  which  assumes  that  the  random 
effects  have  zero  correlation. 

The  estimated  discrete  distribution  contained  three  points  with  locations  (-3.74, 
-3.31),  (-19.67,  -5.31),  and  (-3.87,  -7.76),  and  masses  0.44,  0.22,  and  0.33,  respectively. 
Note  the  extremely  small  first  coordinate  (-19.67)  for  the  second  mass  point,  which 
pertains  to  the  random  effect  for  the  first  logit.  Observing  this  coordinate  during  the 
estimation  algorithm  revealed  that  it  was  decreasing  towards  — oo  with  little  to  no 
change  in  the  parameter  and  log-likelihood  values.  A possible  reason  for  this  behavior 
is  that  82%  of  the  litters  contained  no  dead  or  reabsorbed  fetuses,  with  the  maximum 
number  of  dead  or  reabsorbed  fetuses  being  two  in  the  remaining  litters.  As  a result  of 
this  small  coordinate,  the  estimate  of  the  standard  deviation  for  the  first  threshold  and 
the  correlation  estimate  are  suspect,  as  are  the  estimates  for  the  threshold  parameters 
as  they  are  functions  of  the  mass  points.  Examining  the  dosage  parameters  we  see 
that  similar  results  are  found  for  both  the  parametric  and  nonparametric  approaches, 
though  the  NPML  approach  does  have  a larger  estimate  and  standard  error  for  the 
dosage  parameter  in  the  second  logit.  Conclusions  based  on  the  dosage  parameters, 
however,  would  remain  the  same  for  both  approaches. 

Life  satisfaction  data 

In  a similar  manner,  we  now  can  nonparametrically  analyze  the  item  response  data 
(Table  3.4)  originally  analyzed  parametrically  in  Section  3.6.3.  Recall  that  subjects 
were  asked  to  rate  their  degree  of  life  satisfaction  with  their  family,  hobbies,  and 
residence  using  a three-point  scale  (1  = Low,  2 = Medium,  3 = High).  In  Section 
3.6.3  we  utilized  a baseline-category  logit  model  that  allowed  for  correlated  random 
threshold  effects  in  order  to  estimate  the  item  parameters.  This  model  was  of  the 
form 


Vijr  ft Fr  ij  1 T ft Hr  %ij2  T ft Rr  T 


(5.11) 


216 


Table  5.2:  Parameter  estimates  for  model  (5.10)  from  both  the  NPML  algorithm  and 
the  adaptive  Gauss-Hermite  algorithm  (AGH)  with  18  quadrature  points  using  the 
continuation-ratio  logit  link. 


NPML 

AGH 

Oil 

-7.284 

-4.198 

a 2 

-5.242 

-4.356 

PdOi 

0.092 

(.211) 

0.083 

(.217) 

Pdo2 

2.320 

(.276) 

1.780 

(.219) 

*1 

6.567 

0.559 

02 

1.948 

1.587 

P 

0.027 

0.080 

LL 

-459.532 

-464.733 

where  rqr  is  a threshold-specific  random  effect,  and  the  remaining  parameters  and 
covariate  are  defined  as  in  Section  3.6.3. 

Using  an  assumption  of  normality  for  the  random  threshold  effects  that  allowed 
for  correlation,  we  estimated  the  item  parameters  for  each  of  the  two  logits.  Alter- 
natively one  could  assume  that  the  threshold  random  effects  follow  an  unspecified 
discrete  bivariate  distribution,  and  use  the  NPML  algorithm  to  estimate  the  item 
parameters.  Table  5.3  contains  the  results  for  this  model,  as  well  as  the  correspond- 
ing parametric  results  using  the  adaptive  Gauss-Hermite  algorithm  (AGH)  with  15 
quadrature  points.  As  we  did  with  the  adaptive  algorithm,  we  modified  the  NPML 
algorithm  to  exploit  the  fact  that  only  27  unique  response  profiles  were  possible  for 
this  experiment. 

The  NPML  estimate  of  the  discrete  distribution  had  a support  size  of  four. 
The  mass  points  and  masses  were  given  by  {(4.39,2.74),  (1.55,0.64),  (0.08,1.11), 
(“2.71,-1.33)}  and  {(.22,  .40,  .35,  .03)},  respectively.  The  results  for  both  models 
were  very  similar,  including  the  estimates  for  the  standard  deviations  of  the  random 
threshold  effects.  One  might  consider  using  the  NPML  approach  over  the  adaptive 


217 


Table  5.3:  Parameter  estimates  for  model  (5.11)  from  both  the  NPML  algorithm  and 
the  adaptive  Gauss-Hermite  algorithm  (AGH)  with  15  quadrature  points  using  the 
baseline-category  logit  link. 


NPML 

AGH 

Pfi 

1.466 

(.156) 

1.384 

(.169) 

Phi 

0.997 

(.122) 

0.933 

(.119) 

Pr.x 

1.209 

(.119) 

1.144 

(.115) 

Pf2 

3.300 

(.160) 

3.264 

(.161) 

Ph2 

1.732 

(.120) 

1.709 

(.118) 

So 

to 

1.530 

(.122) 

1.509 

(.118) 

0.898 

0.832 

&2 

1.723 

1.626 

P 

0.862 

0.617 

LL 

-3734.85 

-3736.93 

approach  purely  for  computational  reasons,  as  the  latter  requires  numerical  approxi- 
mation of  two  integrals.  However,  the  computational  effort  for  the  NPML  approach 
with  multiple  random  effects  is  relatively  high  as  well.  Matrices  in  the  algorithm 
quickly  become  large  as  an  additional  K dummy  variables  are  needed  for  each  addi- 
tional random  effect.  Use  of  the  acceleration  techniques  discussed  in  Section  4.2  is 
highly  recommended  when  the  number  of  random  effects  is  increased. 

5.4  Simulation  Study 


In  the  previous  section  we  showed  how  one  could  use  the  parametric  approaches 
of  Chapter  3 to  analyze  multi-center  clinical  trial  data.  If  one  is  willing  to  assume 


218 


that  the  centers  constitute  a random  sample  from  some  population  of  centers,  one  can 
estimate  the  variability  among  the  centers  as  well  as  the  variability  in  the  association 
parameter  across  the  centers.  As  the  estimation  algorithms  in  Chapter  3 will  pro- 
vide estimates  for  the  heterogeneous  association  model  for  the  majority  of  datasets, 
it  is  easy  to  blindly  apply  such  an  approach  and  interpret  the  resulting  estimates. 
However,  consider  the  data  in  Table  5.1.  In  this  multi-center  clinical  trial,  data  were 
collected  at  only  eight  centers.  Suppose  we  actually  observed  the  random  center  and 
interaction  effects  for  each  center.  Thus  we  would  have  a sample  of  eight  pairs  of 
realizations  that  we  would  assume,  under  the  parametric  homogeneous  association 
model,  arose  from  a bivariate  normal  distribution  with  zero  mean  and  unknown  co- 
variance  matrix.  The  standard  multivariate  sample  variance  formula  could  then  be 
used  to  estimate  the  unknown  covariance  matrix.  Indeed,  we  would  probably  have 
little  faith  in  this  estimate  as  it  is  based  on  only  eight  observations.  When  fitting 
the  heterogeneous  association  model,  we  do  not  have  the  realizations  of  the  random 
effects  and  must  use  the  information  in  the  data  to  learn  about  their  behavior.  Thus, 
it  is  questionable  how  well  we  can  describe  their  distribution  based  on  such  small 
numbers  of  centers. 

In  order  to  assess  the  performance  of  the  adaptive  algorithm  for  fitting  the  para- 
metric heterogeneous  association  model  to  multi-center  clinical  trial  data,  we  per- 
formed a number  of  simulations.  For  the  heterogeneous  association  model  (5.7), 
there  are  a large  number  of  factors  that  could  be  examined  in  a simulation  study. 
For  example,  the  number  of  centers,  the  sample  size  within  centers,  the  distribution 
of  the  random  effects,  the  structure  of  the  covariance  matrix  for  the  random  effect, 
etc.  To  study  all  possible  factors  would  be  impractical  in  a single  simulation  study. 
In  addition,  fitting  model  (5.7)  requires  the  approximation  of  two  integrals  for  each 
center.  Thus  each  simulation  entails  a high  amount  of  computation  and  time.  We 


219 


also  found  for  datasets  such  as  Table  5.1  with  small  numbers  of  clusters,  that  a large 
simulation  size  is  required  to  reduce  the  Monte  Carlo  error  to  a reasonable  level. 

In  light  of  these  remarks,  we  performed  a series  of  simulations  using  the  hetero- 
geneous random  effects  model  (5.7) 

Vijr  — olt  + faj  + u*  + (-1)/Ij=1^*,  (5.12) 

r = !,•••  ,q  = R-  1,  * = 1,  ••',«,  j = 1,2, 

where  we  set  the  number  of  responses  R = 3,  aq  = —1,  a2  = -5,  and  /?  = .75. 
We  also  let  Xj  — 1 if  j = 2 and  zero  otherwise.  We  considered  two  sample  size 
structures  for  the  data.  In  the  first  sample  size  structure,  we  used  a small  number 
of  centers  (8),  but  a large  number  of  observations  for  each  treatment  within  center 
(30).  For  the  second  sample  size  structure,  we  increased  the  number  of  clusters  to 
20,  but  reduced  the  number  of  observations  per  treatment  to  12.  The  next  factor 
that  we  set  was  the  covariance  structure  for  the  random  effects.  For  simplicity  we 
set  the  covariance  term  in  E to  be  zero  and  looked  at  two  sets  of  values  for  the 
variance  components  (of,^),  where  the  first  variance  component  corresponds  to  the 
random  center  effect  and  the  second  to  the  random  interaction  effect.  The  values 
considered  were  (0.3,  1.0)  and  (1.0,  0.3),  which  examined  one  situation  where  the 
random  interaction  component  was  large  and  the  other  where  the  random  center 
component  was  large.  We  also  looked  at  a third  case  were  a random  center  effect 
existed,  but  the  random  interaction  did  not.  For  this  case  (of,c r\)  = (0.3,  0).  Finally, 
we  considered  two  cases  for  the  distribution  of  the  random  effects.  In  the  first  case 
we  assumed  a bivariate  normal  distribution.  For  the  second  case  we  used  a mixture 
of  two  bivariate  normal  distributions  to  produce  a non-normal  distribution  with  the 
required  mean  and  covariance  structures.  We  discuss  how  this  was  done  below. 

Simulating  data  from  multivariate  non-normal  distributions  is  difficult  as  most  do 
not  have  the  flexible  parameterizations  for  the  covariance  matrix  as  the  multivariate 


220 


normal  distribution  has.  An  alternative  approach  for  simulating  non-normal  mul- 
tivariate distributions  with  known  covariance  structures  is  to  use  a mixture  of  two 
multivariate  normal  distributions.  In  this  approach,  one  samples  from  the  first  mul- 
tivariate normal  with  probability  p and  the  second  with  probability  1 — p.  Suppose 
the  random  variable  x is  distributed  as 


/(x)  =pMVN(n1,  E0  + (1  - p)  MWV(/z2.E2),  (5.13) 

where  MVN  denotes  the  multivariate  normal  distribution  and  0 < p < 1.  Johnson 
(1987,  p.  57)  showed  that  the  mean  and  variance  of  x distributed  as  (5.13)  are 

£(x)  =p//1  + (1  -p)  n2, 

Cov(x. ) =p  Si  + (1  -p)  E2  +p(l  -p)(n ! - n2)(n  1 - /x2)'. 


Using  these  formulas  we  chose  values  for  Ex,  E2,  nx,  ii2,  and  p that  would  produce 
the  covariance  structures  defined  above.  The  values  chosen  were  p — 0.5, 


Mi  = 


1/6 

1/6 


Ei  = 


23/18 

-1/6 


-1/6 

2/45 


S2 


2/3  -1/9 

-1/9  1/2 


Figure  5.1  displays  a contour  plot  of  the  mixture  distribution  produced  by  the  choices 
given  above. 

In  summary,  the  factors  that  were  selected  for  the  simulation  runs  are  given  in 


the  table  below. 


221 


Sampling  Structure 

Covariance  Structure 

Random  Effects  Distribution 

(#  Centers,  # Obs/Trt) 

(bivariate) 

(8,  30) 

(.3,  1.0) 

Normal 

(20,  12) 

(1.0,  .3) 

Mixture 

(.3,  0) 

Since  the  covariance  structure  without  the  random  interaction  was  only  run  with  a 
normal  center  effect,  there  were  a total  of  ten  simulations  run.  Pilot  simulation  studies 
revealed  that  the  Monte  Carlo  error  associated  with  the  association  parameter  was 
quite  large  for  the  eight  center  case.  Thus  simulation  sizes  of  1,000  were  used  for 
the  runs  with  eight  centers,  and  sizes  of  750  were  used  for  the  20  center  cases.  This 
reduced  the  Monte  Carlo  error  in  the  parameter  estimates  to  approximately  0.01  for 
both  center  sizes. 

Tables  5.4  and  5.5  contain  the  results  for  eight  of  the  ten  simulations  performed. 
Missing  are  the  two  simulations  in  which  we  simulated  data  from  a univariate  normal 
distribution  such  that  the  random  interaction  did  not  exist.  For  both  of  these  simu- 
lations (ie.  cluster  sizes  of  8 and  20),  the  adaptive  quadrature  algorithm  had  severe 
convergence  problems.  In  better  than  85%  of  the  simulated  datasets  the  algorithm 
failed  to  converge.  It  was  obvious  that  the  cause  of  the  convergence  problem  was  the 
near  zero  estimate  for  the  interaction  variance  component.  In  hope  of  alleviating  this 
problem,  we  modified  the  adaptive  algorithm  so  that  the  elements  of  the  Cholesky 
square  root  of  the  covariance  matrix  were  estimated  instead  of  the  actual  variance 
components.  As  noted  before,  this  approach  often  performs  better  when  the  variance 
components  are  small.  For  the  bivariate  random  effects  case,  the  Cholesky  square 


222 


CM 

?T 


ua 


Figure  5.1:  Contour  plot  of  the  bivariate  normal  mixture  distribution  used  for  Table 
5.5. 


223 


root  Q is  given  by 


012 

a12 

^2 

= QQ' 


- 

Qu 

0 

Qn 

Q21 

Q21 

Q22 

0 

Q22 

Thus  o\  — q\±,  o 12  — 911921,  and  erf  = q \x  + q\2.  Unfortunately  this  did  not  remedy 
the  convergence  problems  as  the  estimate  of  <722  was  very  near  zero  as  well.  We  do 
not  feel  this  is  a problem  of  the  algorithm  itself.  Indeed,  if  one  were  questioning 
whether  to  fit  the  heterogeneous  random  effects  model,  one  can  use  the  failure  of  the 
algorithm  as  a indication  that  it  is  not  needed.  By  observing  the  parameter  estimates 
at  each  iteration  of  the  algorithm,  one  can  easily  see  which  variance  component  is 
near  zero.  For  those  simulations  that  did  converge,  the  estimate  for  the  interaction 
component  was  very  near  zero.  In  such  instances,  evaluation  of  the  observed  infor- 
mation became  unstable  and  often  times  led  to  negative  variance  estimates  for  the 
parameters.  Due  to  the  unstable  estimates  and  poor  convergence  rates,  we  excluded 
any  tables  summarizing  these  results. 

Table  5.4  contains  the  results  for  the  simulations  in  which  the  random  effects 
were  generated  from  a bivariate  normal  distribution.  Reported  are  the  estimated 
biases  in  the  parameter  estimates,  along  with  the  average  standard  error  estimate 
of  /3  calculated  from  the  observed  information  matrix  and  the  corresponding  Monte 
Carlo  estimate.  Examining  the  bias  in  /3,  the  parameter  of  interest,  we  see  that  the 
degree  of  bias  is  minor.  The  largest  absolute  estimated  bias  (-0.018)  occurred  with  a 
cluster  size  of  20  and  an  interaction  variance  component  of  1.0.  This  corresponds  to 
a percent  bias  of -2.4%.  As  one  would  expect,  both  cluster  sizes  had  larger  estimated 
biases  in  /3  when  the  random  interaction  component  was  larger.  We  also  see  that 
the  smaller  cluster  size  had  smaller  estimated  biases  for  /3  than  the  larger  cluster 


224 


size.  This  is  due  to  the  difference  in  treatment  sample  size  between  the  two  cluster 
sizes  (from  30  to  12).  Examining  the  threshold  estimates,  we  see  for  the  cluster  size 
of  eight  that  the  biases  nearly  doubled  when  the  variance  component  for  the  center 
effect  was  changed  from  .3  to  1.0.  In  contrast,  the  biases  remained  the  same  or 
decreased  for  the  cluster  size  of  20.  These  results  indicate  that,  with  a larger  cluster 
size,  the  heterogeneity  model  can  more  accurately  estimate  the  threshold  parameters 
even  with  a large  center  variance  component. 

We  can  also  see  some  clear  patterns  from  the  biases  in  the  variance  components. 
First,  the  estimated  biases  for  all  variance  components  are  smaller  when  the  largest 
variation  comes  from  the  treatment-by-center  interaction.  In  these  simulations,  the 
estimated  bias  in  the  center  variance  component  was  approximately  -.01  (or  -4%  bias), 
regardless  of  cluster  size.  The  interaction  variance  component  exhibited  greater  bias, 
with  the  eight  cluster  size  having  a -13%  bias.  It  is  interesting  that  the  20  cluster  size 
had  only  a -.5%  bias  for  the  interaction  variance  component  (explainable  by  Monte 
Carlo  error),  but  had  a larger  bias  in  the  estimate  for  (3.  The  larger  bias  in  the 
estimate  for  (3  is  probably  due  to  the  smaller  treatment  sample  size.  When  the  larger 
variance  component  was  associated  with  the  center  effect,  there  was  considerable 
bias  in  the  variance  component  estimates.  Indeed,  for  these  cases  the  estimates  of 
the  covariance  matrix  are  very  poor.  We  suspect  that  the  large  variation  in  the  center 
effect  swamped  the  interaction  variance  component,  causing  difficulties  in  estimation 
of  the  variance  components.  In  light  of  these  poor  estimates,  the  standard  error 
estimates  for  /3  based  on  the  observed  information  matrix  were  reasonable.  For  all 
simulations  these  estimates  were  lower  than  the  corresponding  Monte  Carlo  estimates. 
The  largest  differences  occurred  when  the  interaction  variance  component  was  large. 

Table  5.5  contains  the  results  for  the  simulations  in  which  the  random  effects  were 
simulated  from  the  distribution  given  in  Figure  5.1.  Bias  estimates  for  (3  are  similar 
to  those  in  Table  5.4  with  the  exception  being  the  eight  cluster  size  with  (1,  .3)  which 


225 


had  an  average  bias  of  .03  or  about  4%.  It  is  also  interesting  for  that  setting  that 
the  threshold  biases  are  quite  small,  even  though  the  center  variance  component  was 
large.  In  contrast,  the  eight  cluster  size  with  a center  variance  component  of  .3  had 
considerable  more  bias.  It  is  difficult  to  explain  this  occurrence.  We  suspect  it  is  a 
combination  of  the  small  center  size  and  the  form  of  the  bivariate  normal  mixture. 
Only  eight  observations  are  selected  from  the  mixture  for  each  dataset,  thus  it  is  quite 
possible  to  select  most  observations  from  one  of  the  bivariate  normal  components.  The 
one  bivariate  normal  component  has  a small  variance  component  for  center  (2/3), 
which  would  result  in  reduced  biases  for  the  thresholds  as  seen  in  Table  (5.5).  For 
the  cluster  size  of  20,  the  patterns  in  Table  5.4  held,  with  decreasing  biases  seen  in 
both  the  threshold  parameters  and  association  parameter  when  moving  from  the  (.3, 
1)  case  to  the  (1,  .3)  case.  The  biases  in  this  table  are  generally  greater,  however.  We 
again  see  considerable  biases  in  the  variance  components.  For  this  table,  all  biases 
were  large,  regardless  of  the  cluster  size  or  variance  component  values.  Surprisingly, 
the  standard  error  estimates  are  still  in  close  agreement  under  the  mixture  model. 

From  these  simulations  we  see  that  one  can  obtain  fairly  accurate  estimates  of 
the  association  parameter  and  its  standard  error,  even  when  the  cluster  size  is  small, 
or  the  treatment  sample  size  is  small.  Estimates  for  the  variance  components  can 
be  much  less  accurate,  especially  if  the  random  effects  distribution  deviates  from 
normality.  For  extremely  small  cluster  sizes  or  extremely  sparse  data,  we  would 
expect  the  results  to  continue  to  deteriorate.  Thus  caution  must  be  taken  when 
interpreting  estimates  such  as  the  variance  components.  Though  not  examined  here, 
one  can  also  expect  predictions  of  the  center  effects  and  center-specific  cumulative  log 
odds  ratios  to  be  poor  as  well,  as  they  are  based  on  the  estimates  of  the  parameters, 
and  in  particular  the  random  effects  distribution  estimates. 


226 


Table  5.4:  Estimated  bias  of  parameter  estimates  for  the  parametric  heterogeneous 
random  effects  model  (5.12)  using  simulated  multi-center  clinical  trial  data  with  treat- 
ment sample  sizes  of  riij.  The  random  effects  were  simulated  from  a bivariate  normal 
distribution  with  covariance  structure  (of, of),  and  covariance  term  <712  = 0. 

8 CLUSTERS  (n^-  = 30)  20  CLUSTERS  (n^  = 12) 
(-3,  1)  (1,  -3)  (.3,  1)  (1,  .3) 


Cti 

0.013 

0.022 

0.009 

0.008 

«2 

0.016 

0.026 

0.007 

0.000 

fi 

-0.003 

0.001 

-0.018 

-0.007 

(SE(/3  )0) 

(.673) 

(.431) 

(.468) 

(.302) 

(SE(/3)mc) 

(.733) 

(.419) 

(.478) 

(.307) 

-0.013 

0.590 

-0.011 

0.662 

-0.130 

-0.721 

-0.005 

-0.697 

012 

0.012 

0.013 

-0.007 

-0.002 

Table  5.5:  Estimated  bias  of  parameter  estimates  for  the  parametric  heterogeneous 
random  effects  model  (5.12)  using  simulated  multi-center  clinical  trial  data  with  treat- 
ment samples  sizes  of  The  random  effects  were  simulated  from  a mixture  of  two 
bivariate  normal  distributions  with  covariance  structure  (of,  of),  and  covariance  term 
0i2  = 0 (see  Figure  5.1). 


8 CLUSTERS  (ri„  = 30)  20  CLUSTERS  ln„  = 12) 
(-3,1)  (1,  .3)  (.3,1)  (1,  .3) 

Oil 

0.162 

-0.001 

0.023 

0.011 

Oi2 

-0.125 

-0.001 

0.033 

0.014 

P 

-0.003 

0.029 

-0.015 

0.001 

(SE  0)o) 

(.386) 

(.425) 

(.285) 

(.306) 

(SE  0)MC) 

(.382) 

(.429) 

(.291) 

(.304) 

of 

0.911 

0.578 

0.941 

0.673 

-0.741 

-0.698 

-0.771 

-0.698 

012 

0.000 

-0.014 

0.043 

-0.005 

227 


5.5  Score  Tests  for  a Common  Association  Parameter 

The  assumption  that  the  association  parameter  is  exactly  the  same  across  all  cen- 
ters is  generally  unrealistic.  The  heterogeneous  association  model  (5.7)  is  a straight- 
forward extension  of  the  homogeneous  model  that  relaxes  this  assumption.  It  typically 
provides  larger  standard  error  estimates  for  the  association  parameter  than  the  ho- 
mogeneous model,  reflecting  the  uncertainty  in  the  assumption  of  homogeneity.  For 
this  reason,  it  may  be  recommended  over  the  homogeneous  model  even  when  one 
might  expect  the  heterogeneity  to  be  small.  Even  so,  it  would  be  desirable  to  have 
a statistical  test  for  determining  if  the  common  association  assumption  holds.  Cer- 
tainly it  is  more  desirable  for  a pharmaceutical  company  to  know  that  a particular 
drug  behaves  the  same  at  all  centers,  as  well  as  for  them  to  be  able  to  report  a single 
estimate  of  the  association  between  the  drug  and  the  placebo.  Thus,  in  this  section 
we  consider  score  tests  for  testing  that  a common  association  holds  across  all  centers. 
In  terms  of  the  heterogeneous  association  model  (5.7),  this  implies  that  the  variance 
component  for  interaction  random  effect  v*  is  zero,  as  well  as  the  covariance  term 
between  u*  and  v* . As  noted  in  Section  3.4.2,  testing  that  a variance  component  is 
zero  involves  a nonstandard  condition  in  that  the  parameter  is  on  the  boundary  of 
the  parameter  space.  Thus  likelihood-ratio  and  Wald  tests  are  not  appropriate.  In 
this  section  we  consider  score  tests  for  testing  this  hypothesis. 

Before  reviewing  some  of  the  recent  proposals  for  testing  for  zero  variance  com- 
ponents in  random  effects  models,  we  mention  that  tests  of  homogeneity  exist  for  the 
fixed  effects  heterogeneity  model  (5.2)  as  well,  treating  the  number  of  centers  as  fixed. 
The  common  way  of  testing  for  a homogeneous  association  is  through  the  use  of  the 
likelihood-ratio  test  comparing  model  (5.2)  to  model  (5.1).  As  the  difference  in  the 
number  of  parameters  between  the  two  models  is  n — 1,  under  the  null  hypothesis  of 
no  interaction  this  test  follows  a Xn-i  distribution.  Recently,  Uesaka  (1993)  proposed 
a measure  of  interaction  between  treatment  and  stratum  when  the  response  variable 


228 


is  ordinal  and  there  are  only  two  treatments.  Along  with  the  proposed  measure,  he 
provided  a test  of  the  hypothesis  of  no  interaction.  In  his  proposals,  Uesaka  (1993) 
assumed  that  the  strata  represented  a fixed  factor,  such  as  gender. 

There  have  been  a number  of  proposals  for  testing  that  the  variance  component 
in  a generalized  linear  model  with  a single  random  effect  is  zero.  Jacqmin-Gadda  and 
Commenges  (1995),  who  extended  the  work  of  Liang  (1987),  proposed  a score  test 
for  testing  homogeneity  among  clustered  data  adjusting  for  the  effects  of  covariates. 
Their  test  was  restricted  to  canonical  generalized  linear  models  that,  under  the  al- 
ternative hypothesis,  included  only  a random  intercept.  More  recently,  Lin  (1997) 
proposed  a global  score  test  for  testing  that  all  variance  components  in  a generalized 
linear  mixed  model  are  zero.  The  test  was  derived  under  the  assumption  that  the 
random  effects  were  independent.  Her  test  could  easily  be  extended  to  the  multivari- 
ate generalized  linear  models  considered  here,  however  we  are  interested  in  testing 
that  individual  variance  components  are  zero. 

Very  little  work  has  been  done  in  the  area  of  testing  that  a single  variance  com- 
ponent is  zero  in  the  presence  of  other  random  effects.  In  addition  to  her  global  score 
test,  Lin  (1997)  proposed  an  individual  score  test,  and  its  approximation,  for  testing 
a single  variance  component  to  be  zero.  As  in  the  global  test,  she  assumed  that  all 
of  the  random  effects  were  independent.  Since,  even  under  the  null  hypothesis,  the 
score  statistic  does  not  have  a closed  form,  Lin  (1997)  utilized  the  Laplace  method  to 
approximate  the  intractable  integrals.  An  advantage  of  this  approach  is  that  the  com- 
ponents of  the  score  statistic  can  be  obtained  from  fitting  the  null  hypothesis  model 
(i.e.,  the  model  without  the  random  effect  having  a zero  variance)  using  a penalized 
quasi-likelihood  (or  pseudo-likelihood)  method  such  as  that  by  Breslow  and  Clayton 
(1993)  or  Wolhnger  and  O’Connell  (1993).  Lin  (1997)  reported,  however,  that  the 
approximate  score  test  had  similar  problems  with  small  binomial  sample  sizes  as  the 
penalized  quasi-likelihood  estimation  procedure. 


229 


As  we  have  already  generalized  the  pseudo-likelihood  method  of  Wolfinger  and 
O’Connell  (1993)  in  Section  3.5  to  multinomial  random  effects  models,  generalizing 
the  individual  score  test  of  Lin  (1997)  is  straightforward.  Thus  we  begin  in  Section 
5.5.1  by  outlining  the  Laplace  approximated  score  test  of  Lin  (1997)  for  multinomial 
random  effects  models.  As  noted  before,  this  test  assumes  that  all  random  effects  are 
independent.  It  also  has  been  shown  to  perform  poorly  with  small  binomial  samples 
sizes,  and  we  suspect  that  similar  behavior  could  occur  with  small  multinomial  sample 
sizes.  Thus,  in  Section  5.5.2  we  consider  a second  score  test  for  testing  that  a variance 
component  (or  a subset  of  the  variance  components)  is  zero.  In  this  test  we  allow  for 
correlated  random  effects,  and  approximate  the  intractable  integrals  using  adaptive 
Gauss-Hermite  quadrature.  A disadvantage  of  this  approach  is  that  verification  of  the 
null  hypothesis  distribution  is  difficult  due  to  the  quadrature  approximation.  Thus  in 
Section  5.5.3  we  examine  the  performance  of  the  proposed  test  through  simulation. 
In  both  Sections  5.5.1  and  5.5.2  we  motivate  the  tests  for  the  general  multinomial 
random  effects  model. 

5.5.1  Laplace  Approximated  Score  Test 

Consider  the  complete  multinomial  random  effects  model 

ri  — Zf3  + Wu,  (5.14) 

where  Z = [ Zij ],  W = diag(I4/jJ)  and  u = [ii*].  For  the  Laplace  approximated  score 
test,  we  assume  that  the  random  effects  are  independent  and  that  u)  = (un,  • • • , uim) 
follows  a multivariate  normal  distribution  with  mean  0 and  covariance  matrix  E = 
diag(cr^),  c =!,-•■  ,ra.  We  are  interested  in  testing  the  null  hypothesis 


Ha  : o\  = 0 versus  Ha\  o\>  0. 


(5.15) 


230 


In  the  following,  we  partition  the  complete  random  effects  vector  u into  u = (u-^,  u^) 
where  uc  = (ulc,  • • • , unc ) contains  the  random  effect  corresponding  to  the  zero  vari- 
ance component  for  all  n subjects,  and  u~c  contains  the  remaining  m — 1 random 
effects  for  all  n subjects.  Likewise,  a matrix  or  vector  superscripted  by  — c is  formed 
or  calculated  under  the  null  hypothesis.  Thus  W~c  denotes  the  complete  random 
effects  design  matrix  with  the  columns  pertaining  to  uc  removed. 

Following  Lin  (1997),  we  write  the  marginal  log-likelihood  of  model  (5.14)  as 


KPi  s)  = log  J exp{  l( y;  uc)  + log  c/MVN(uc;  0,  a2c)  } duc,  (5.16) 

where 

^(y;ue)  = log  J exp{  log/(y  | (3;u)  + log£HVN(u~c;  0,  E“c)  } du~c. 

We  proceed  by  calculating  the  score  statistic  sa2(^~c)  = where  = 

c da2c 

(/3  , vech(E~c))  . Due  to  the  intractable  integrals  in  (5.16)  we  take  a Laplace  expansion 
of  l(/3,  E)  about  al  = 0.  This  is  equivalent  to  expanding  /( y;  uc)  + log^HVN(uc;  0,  al) 
about  the  true  mean  of  the  random  effect  uc  = 0 and  then  integrating  term  by  term. 
A two-term  Taylor  expansion  yields 


exp{/(y;uc)  + log ^MVN(uc;  0,cr^)}  _ exp{/(y;o)}  ( 1 + d ^y’ uc+ 


1 


-u 


d uc' 


dl( y;0)  d l( y;0)  + d2l( y;  0) 


d uc 


d uc' 


d uc  d uc' 


uc  + e , (5.17) 


where  e contains  third  and  higher  terms  of  uc.  Denoting  the  subset  of  the  random 
effects  design  matrix  associated  with  uc  by  Wc,  we  have  that 


d /(y; u“)  = wc,  and  <fl(y;u‘)  = w<,  ^(y;u«)  ^ 


d uc 


d r) 


d uc  d uc' 


d r\  d r\ 


(5.18) 


231 


Also  note  that 


(5.19) 


(5.20) 


where  D~c  = diag(D^c),  R^c  = diag(_R^f. ) , and  0~c  = diag (0^c)  are  calculated 
from  the  null  hypothesis  model  (see  Section  2.3  for  definitions  of  Dij,  R ■7Vij,  and  O^). 

Integrating  (5.17)  under  the  moment  assumptions  on  u~c  and  then  applying  to 
(5.16)  using  (5.18)  - (5.20),  one  can  obtain  the  following  expression  for  the  score 
statistic 


hypothesis  model.  These  estimates  can  be  obtained  by  fitting  the  reduced  multinomial 
random  effects  model  that  omits  the  random  effect  uc.  For  this  test,  we  will  use 
the  restricted  maximum  likelihood  estimates  obtained  from  the  pseudo-likelihood 
approach  of  Wolfinger  and  O’Connell  (1993).  Letting 


where  l denotes  the  marginal  log-likelihood  (5.16),  the  score  statistic  for  testing  (5.15) 
is  given  by 


tr [Wc' (0~c  - D~c  R^1  D-C')W~C}}  y . (5.21) 


To  evaluate  (5.21),  one  would  plug  in  the  estimates  of  c obtained  under  the  null 


(5.22) 


A 


(5.23) 


232 


Note  that  (5.23)  is  calculated  under  a2  = 0.  Under  the  null  hypothesis,  it  can  be 
shown  that  Xs  follows  an  asymptotically  standard  normal  distribution.  Here  the 
asymptotics  refer  to  the  number  of  clusters  going  to  infinity  while  the  number  of 
observations  on  each  cluster  remains  bounded.  The  proof  of  this  for  the  multinomial 
random  effects  models  follows  directly  from  Lin  (1997,  see  text  and  Appendix  2). 

The  score  test  in  (5.23)  contains  intractable  integrals  in  both  the  numerator  and 
denominator.  Lin  (1997)  used  the  Laplace  method  to  approximate  these  sets  of 
intractable  integrals.  This  parallels  what  was  done  in  (5.17),  with  exp{7(y;uc)  4- 
log  <?mvn(uc;  0,  a2)}  replaced  by  exp{Z(y;u“c)  + log£HVN(u_c;  0,  E_c)}  and  the  expan- 
sion taken  about  u~c  = u-c,  where  u~c  denotes  the  maximum  point  of  l(y\vTc)  + 
log  <?mvn(u-c;  0,  E~c). 

Following  her  derivation  exactly,  it  can  be  shown  that  the  Laplace  approximated 
score  statistic  (5.21),  evaluated  under  the  null  hypothesis  and  at  the  restricted  max- 

A Q 

imum  likelihood  estimates  , is 


')  = 5{<y 


Z0)  V~°  1 We  M"'  V~c  '<y  c - Z0)  - t.rfH"'  P"'  ll":)  j , 

(5.24) 


where  y and  V are  defined  as  in  Section  3.5.2  with  (j)  set  to  1.0  and  R equal  to  the 
identity  matrix,  and 


z (z'  v~c  1 zyxz!  v~c 


The  denominator  of  (5.23)  is  then  approximated  by  taking  the  expected  value  of  the 
square  of  (5.24).  Let 


2 


P^WtWip-'Zy,  /,/'  = !, 


233 


where  Wi  denotes  the  columns  of  W pertaining  to  the  /th  random  effect.  Then  the 
denominator  is  approximated  by 


F* 


0 


0 0 


c -*7-  — c 


-1 


0'c^ 


where  F*2<t2  = Mcc,  Fl_c  is  an  (m  - 1)  x 1 vector  with  elements  Mic  (l  ^ c),  and 
c c -0  0-2 

Fl_c__c  is  an  (m  — 1)  x (m  — 1)  matrix  with  elements  (l.l'  ^ c). 

0 0 “ 

An  advantage  of  using  the  restricted  maximum  likelihood  estimators  0 from  the 
pseudo-likelihood  method  is  that  a number  of  the  components  of  the  approximated 
score  statistic,  such  as  the  pseudo- vector  y~c,  are  by-products  of  the  algorithm.  Thus 
one  can  use  the  pseudo-likelihood  algorithm  defined  in  Section  3.5.2  to  obtain  some  of 
the  quantities  needed  to  calculate  the  approximated  score  statistic.  Simulation  work 
by  Lin  (1997)  showed  that  the  approximated  score  statistic  performed  poorly  when 
the  binomial  sample  sizes  were  small.  Similar  studies  could  also  be  performed  for  the 
multinomial  random  effects  models.  We  suspect  that  similar  behavior  will  occur  for 
these  models  as  well. 


5.5.2  Adaptive  Gauss-Hermite  Quadrature  Approximated  Score  Test 

We  now  consider  testing  that  a subset  of  the  variance  components  in  the  multi- 
nomial random  effects  model  is  zero,  where  the  random  effects  are  allowed  to  be 
correlated.  For  this  test  we  use  adaptive  Gauss-Hermite  quadrature  to  approximate 
the  score  statistic  and  the  corresponding  information  matrices.  For  this  derivation 
we  consider  the  multinomial  random  effects  model  at  the  center  level.  That  is, 

77*  = Zip  + Willi,  (5.25) 

where  = [Z^],  Wi  = [W^],  and  u;  is  assumed  to  be  multivariate  normal  with  mean 
0 and  covariance  matrix  E.  Let  <x  — vech(E)  be  the  unique  elements  of  the  covariance 
matrix  E.  Using  similar  notation  to  the  previous  section,  we  are  interested  in  testing 


234 


that  a subset  of  these  element,  say  erc  is  zero.  Thus  we  can  partition  cr  as 


As  Chant  (1974)  showed  that  the  score  test  retains  its  asymptotic  properties  even 
when  a parameter  is  on  the  boundary  of  the  parameter  space,  a score  test  for  testing 
that  H0  : crc  = 0 is  given  by 

= Sfjc  [F(ycac  ~ F(rccr-CF a^a--c^'(T~C(TC\  (5.26) 

Under  a typical  null  hypothesis,  the  distribution  of  As  would  be  a Xc  distribution, 
with  C being  the  number  of  restricted  parameters  in  the  null  hypothesis.  There  is 
some  question,  however,  if  this  holds  for  testing  erc  = 0.  Consider  a model  with 
two  correlated  random  effects  and  an  unstructured  2 by  2 covariance  matrix.  If  one 
wishes  to  test  that  one  of  the  variance  components,  say  cr|,  is  zero,  then  necessarily 
the  covariance  term  o\2  = 0.  Thus,  for  the  null  hypothesis  H0  : <j\  = <712  = 0, 
one  could  consider  cr|  as  the  only  free  parameter  and  argue  that  the  \s  should  be 
distributed  xl-  However,  this  would  mean  that  the  distribution  of  the  score  test  for 
testing  cr|  = 0 when  the  random  effects  are  correlated  and  when  they  are  uncorrelated 
would  be  the  same.  The  distribution  is  xl  f°r  the  latter  model  since  cr12  is  zero  under 
both  hypotheses.  We  were  unable  to  find  any  literature  in  which  this  situation  has 
been  addressed.  We  feel,  as  the  simulation  study  of  the  next  section  seems  to  suggest, 
that  the  distribution  of  As  is  Xc  where  C is  the  dimension  of  crc. 

In  (5.26),  scrc  is  the  derivative  of  the  marginal  log-likeliliood  for  the  complete 
model  with  respect  to  crc , evaluated  under  the  null  hypothesis  model  maximum  like- 
lihood estimates  and  at  crc  — 0.  The  components,  F..,  of  the  information  matrix  F 
from  the  alternative  hypothesis  model  are  similarly  evaluated.  Unfortunately,  the  ele- 
ments of  (5.26)  do  not  have  closed  form,  and  must  be  approximated.  In  theory,  if  one 


235 


can  approximate  the  intractable  integrals  in  (5.26)  accurately,  then  the  asymptotic 
distribution  of  the  test  statistic  should  hold. 

Direct  calculation  of  (5.26)  for  most  variance  component  tests  is  not  possible.  The 
reason  is  that  under  the  null  hypothesis  certain  elements  of  the  covariance  matrix  E 
will  be  zero.  Only  for  special  tests  (such  as  testing  that  a covariance  term  is  zero) 
will  this  not  lead  to  a singular  matrix.  To  avoid  singular  matrices,  one  can  use  a 
conditional  approach  in  which  the  marginal  likelihood  is  written  conditionally  upon 
the  random  effects  vector  associated  with  the  nonzero  variances.  Specifically,  we 
partition  the  random  effects  vector  for  the  ith  center  into  u-  = (uf,^^),  where 
is  the  random  effects  vector  corresponding  to  the  zero  variance  components,  and  the 
covariance  matrix  into 


Also,  for  notational  convenience,  we  denote  the  conditional  log-likelihood  for  the  fth 
center  as 


Since  we  have  assumed  that  u,  has  a multivariate  normal  distribution,  the  densities 
•^iv|u~c  and  9U~C  are 


MVN[  Ec’— c(E-c--c)-iure,  E*  ] and  MVN[0,  E~c~c], 
respectively,  where  E*  = Ec>c  - Ec  ~C(E_C’_C)_1E_C>C  (see,  e.g.,  Johnson  1987,  p.  50). 


Ec’c  Ec,_c 


E = 


E_c,c  E 


— C,  — C 


Then  the  marginal  likelihood  for  the  ith  subject  can  be  written 


236 


To  avoid  singular  matrices  under  the  null  hypothesis,  we  continue  by  expanding 
exp(/j(yj  | Uj))  in  (5.27)  in  a Taylor  series  expansion  about  = £c,_c(£_c’_c)“1ul_c 
(i.e.  the  conditional  mean  of  u£).  That  is 


exp(/j(y i | u^)  = exp(/i(yi  | Uj))  ( 1 + — (u-  - u-)+ 


d u ? 

d li( y;  G*)  d li( y;  ut)  d2U( y;  Uj) 


(ui  - u-)  + Ci  , (5.28) 


d u-;  d uf  d u?  d uf 

where  u,  = (u-,ut~c).  By  using  the  conditional  moment  assumptions  for  u?,  the 
derivative  formulas  given  in  (5.18),  and  integrating  (5.28)  with  respect  to  g 9,  -c,  one 

i I ui 

obtains  the  marginal  likelihood 


/(yd/3,o-)  = J exp(/j(yi  | Ui))jl+ 


1 

2 tr 


Wc-  | dlj(y]Ui)  dlj( y;u<)  d2^(y;uQ  ^ 


d Vi 


dVidv'i 


+ei  f 9u~'(uiC)  dn{c, 


(5.29) 


where  e*  depends  on  second  and  higher  products  of  the  variance  components.  Under 
the  null  hypothesis  these  terms  are  zero,  and  so  are  ignored.  For  the  multinomial 
random  effects  models,  the  trace  term  in  (5.29)  can  be  rewritten  as 


T 

1 * 


<y«  - *#)'  D'v  W‘  E*  W<  DV  R-^  (y«  - *•„) 


1 = 1 


+ 


Ti 

£ 

3= 1 


tr 


Ujjriyijr  nijr)\  Dij  R-jrtj  Wfj  £ 


w. 


13 


, (5.30) 


where  all  elements  are  calculated  using  u*. 

We  now  have  a representation  of  the  marginal  likelihood  that  can  be  used  in 
the  score  test  (5.26)  since,  under  the  null  hypothesis,  the  derivatives  of  the  log  of 
(5.29)  do  not  contain  singular  covariance  matrices.  The  score  test  (5.26)  requires 
the  first  derivatives  of  (5.29)  with  respect  to  erc,  as  well  as  the  second  derivative 
matrix  with  respect  to  all  of  the  parameters  (/3,<x).  To  complicate  matters,  the 


237 


marginal  likelihood  (5.29)  contains  a possibly  multi-dimensional  intractable  integral. 
For  the  intractable  integrals  one  can  use  Monte  Carlo  or  quadrature  techniques  for 
obtaining  approximations.  We  consider  the  latter  approach  here.  For  the  derivatives 
one  could  calculate  analytical  derivatives  or  use  numerical  derivatives.  Analytical 
derivatives  are  extremely  complicated.  Note  that  all  of  the  elements  in  (5.29)  and 
(5.30)  are  calculated  using  Uj  which  is  made  up  of  EC,-C(E-C  “c)_1ul~c.  This  means 
that  the  elements  of  the  covariance  matrix  are  not  localized  to  the  multivariate  normal 
density  term,  as  they  are  in  the  general  multinomial  random  effects  model.  So  use 
of  standard  formulas  for  derivatives  in  multivariate  generalized  linear  models  is  no 
longer  possible.  For  this  reason  we  utilize  numerical  derivatives  for  evaluating  (5.26). 

We  use  the  following  approach  to  calculate  the  score  statistic  (5.26).  First,  we 
obtain  maximum  likelihood  estimates  of  (3  and  a~c  under  the  null  hypothesis  model 
using  the  adaptive  Gauss-Hermite  algorithm  proposed  in  Chapter  3.  We  then  calcu- 
late numerical  first  and  second  derivatives  of  the  log  of  (5.29)  with  respect  to  erc  and 
(f3,  cr),  respectively,  each  evaluated  at  (3,  <x-c,  and  crc  — 0.  To  calculate  the  numerical 
derivatives,  we  must  numerically  approximate  the  integrals  in  (5.29).  To  utilize  adap- 
tive Gauss-Hermite  quadrature,  we  first  calculate  the  mode  of  the  integrand  in  (5.29) 
as  a function  of  u,,  and  then  the  curvature  of  the  integrand,  evaluated  at  the  mode. 
Using  these  estimates  we  center  and  scale  the  standard  nodes  from  Gauss-Hermite 
quadrature  and  evaluate  (5.29)  with  the  adapted  nodes.  This  must  be  performed 
for  each  center.  The  number  of  quadrature  points  needed  to  approximate  (5.27)  will 
depend  on  the  dimension  of  the  integral  under  the  null  hypothesis.  For  testing  the 
homogeneity  assumption  for  a dataset  such  as  Table  (5.1)  (where  the  integral  is  one- 
dimensional under  the  null  hypothesis)  we  found  the  score  test  value  to  be  accurate  to 
approximately  four  decimal  places  using  15  quadrature  points.  Using  the  numerical 
derivatives  we  then  calculate  the  elements  in  (5.26)  and  compare  the  resulting  test 
statistic  value  to  the  appropriate  x2  value. 


238 


We  close  this  section  with  a number  of  comments  concerning  the  programming 
aspect  of  the  score  test.  As  with  the  algorithms  discussed  in  Chapters  3 and  4,  we  used 
the  matrix  programming  language  OX  to  approximate  the  score  test  and  to  perform 
the  simulation  studies  discussed  in  the  next  section.  Due  to  the  evaluation  of  the 
marginal  likelihood  (5.29)  (and  its  derivatives)  at  parameter  values  on  the  boundary 
of  the  parameter  space,  there  were  times  when  certain  matrices,  such  as  Rn{j  , became 
uninvertable  during  the  calculation  of  the  score  statistic.  This  mainly  occurred  when 
we  were  searching  for  the  mode  of  the  integrand  of  (5.29).  To  avoid  having  the 
maximization  routine  stop  due  to  a function  evaluation  failure,  one  should  check  for 
such  situations  and  alert  the  routine  to  move  on  to  a new  search  value.  We  have  not 
encountered  a situation  in  which  the  final  estimate  of  a mode  yielded  uninvertable 
matrices.  Algorithms  for  numerical  derivatives  must  repeatedly  evaluate  the  function 
that  is  being  differentiated.  As  each  evaluation  of  the  function  requires  n (multiple) 
integral  approximations,  calculation  of  the  score  test  can  require  a large  amount  of 
computational  time.  In  addition,  one  must  find  the  maximum  likelihood  estimates 
under  the  null  hypothesis  model  prior  to  calculating  the  score  statistic.  For  tests 
in  which  the  null  hypothesis  model  has  multiple  random  effects,  this  alone  can  take 
considerable  time. 

5.5.3  Simulation  Study 

We  now  examine  the  use  of  the  adaptive  Gauss-Hermite  quadrature  approximated 
score  test  for  the  testing  of  a common  association  parameter  in  the  heterogeneous 
random  effects  model  (5.7).  We  have  already  seen  in  Section  5.4  that  the  heteroge- 
neous random  effects  model  provides  poor  estimates  of  the  random  effects  distribution 
when  the  number  of  centers  is  small.  Thus,  it  may  be  overly  ambitious  to  test  that 
the  interaction  variance  component  and  the  covariance  term  between  the  center  and 
center-by-treatment  interaction  are  zero.  We  have  also  seen  in  the  previous  simula- 
tion study  that  the  estimates  across  simulations  are  quite  variable,  requiring  a large 


239 


number  of  simulations  to  reduce  the  Monte  Carlo  error.  Unfortunately,  with  the 
large  computational  time  required  to  compute  each  score  statistic,  we  do  not  have 
the  luxury  of  performing  1,000  simulations  as  in  Section  5.4. 

To  examine  the  performance  of  the  score  test,  we  studied  its  behavior  under  the 
null  hypothesis.  Under  the  null  hypothesis,  the  heterogeneous  random  effects  model 
(5.7)  has  the  interaction  variance  component  and  the  corresponding  covariance  term 
set  to  zero.  Thus,  we  simulated  from  the  homogeneous  random  effects  model 

r}ijr  = ar  + f3xj  + ui:  (5.31) 

r = l,---  ,q  = R-  1,  * = 1,  ••■,»»,  j = 1,2, 

with  R — 3,  = —1.25,  a2  = 1.25,  and  (3  = .5.  We  simulated  the  random  effect 

from  a univariate  normal  distribution  with  zero  mean  and  variance  = .5.  The  null 
hypothesis  for  testing  that  a common  association  parameter  holds  is  Ha  : o\  — cr12  = 
0.  Under  the  null  hypothesis,  this  test  has  a xl  distribution. 

We  ran  a number  of  pilot  studies  varying  the  center  size  n and  the  treatment 
sample  size  np-,  and  it  became  clear  that  the  approximated  score  test  was  performing 
very  poorly,  especially  with  small  center  sizes.  To  illustrate  this  behavior,  we  ran  four 
simulations  with  center  sizes  of  8,  30,  50,  and  75,  each  having  n*j  = 30.  It  is  obviously 
unrealistic  that  one  would  have  a clinical  trial  with  50  or  75  centers,  or  that  you  would 
have  30  patients  per  treatment  at  each  center.  However,  we  ran  the  simulations  at 
these  levels  to  show  how  the  score  test  did  improve  with  the  center  size.  In  Table  5.6 
are  the  rejection  rates  for  the  score  test  at  a = 0.01,  0.05,  and  0.10.  For  each  center 
size  we  ran  100  simulations.  Ideally  we  should  run  more  simulations  as  we  have  seen 
that  the  Monte  Carlo  error  in  the  parameter  estimates  can  be  quite  large  with  small 
to  moderate  center  sizes.  However,  the  computational  time  require  to  numerically 
approximate  the  first  and  second  derivatives  of  the  marginal  log-likelihood  are  quite 
high,  and  thus  prohibitive  of  large  simulation  studies. 


240 


Table  5.6:  Rejection  rates  and  average  score  test  value  for  the  adaptive  Gauss-Hermite 
quadrature  approximated  score  test  for  testing  of  a common  association  parameter 
in  the  heterogeneous  association  model  (5.31)  with  riij  = 30. 


NUMBER  OF 
CENTERS 

TYPE  I a RATE 
0.10  0.05  0.01 

AVERAGE  SCORE 
TEST  VALUE 

8 

0.24 

0.21 

0.12 

2.275 

30 

0.25 

0.18 

0.11 

3.093 

50 

0.18 

0.14 

0.04 

2.403 

75 

0.11 

0.08 

0.02 

1.954 

We  can  clearly  see  in  Table  5.31  that  the  score  test  performs  very  poorly  when 
the  number  of  centers  is  small  to  moderate.  Even  with  a center  size  of  50,  the  type  I 
error  rate  for  the  score  test  overestimated  the  nominal  level  by  a fair  amount.  With 
a center  size  of  75,  the  reported  type  I rates  were  close  to  the  nominal  levels.  The 
average  test  statistic  values  are  also  reported.  The  Monte  Carlo  error  associated  with 
these  average  test  statistic  values  for  center  sizes  8,  30,  50,  and  75  are  approximately 
1.8,  1.3,  .28,  .23,  respectively.  This  illustrates  the  large  variability  present  in  the  8 
and  30  center  size  runs.  The  large  variability  with  the  smaller  center  sizes  did  not 
improve  when  we  increased  the  treatment  size  within  each  center.  For  example,  with 
eight  centers  and  treatment  sizes  of  100  the  rejection  rates  at  a = 0.10,  0.05,  and  0.01 
were  0.27,  0.18,  and  0.11,  respectively.  As  mentioned  before,  we  would  expect  that 
the  approximated  score  test  would  perform  poorly  for  small  numbers  of  centers  due 
to  the  lack  of  accurate  information  concerning  the  variance  components.  However, 
there  are  a number  of  other  factors  relating  to  the  derivation  of  the  test  statistic 
that  may  be  adversely  affecting  the  computed  value.  When  we  examined  the  actual 
test  statistic  values  computed  for  the  simulations  with  center  size  8,  we  found  that 
they  ranged  from  -78.78  to  210.61!  We  found  similar  results  for  the  center  size  of  30, 
though  there  were  fewer  occurrences  of  the  extreme  values.  Obviously,  an  element(s) 
of  the  score  statistic  was  being  incorrectly  estimated,  or  more  accurately,  was  unable 
to  be  correctly  estimated. 


241 


We  feel  that  there  are  three  possible  reasons  for  the  failure  of  the  score  statistic 
in  certain  situations.  First,  the  failure  could  be  caused  by  the  small  center  size  in 
that  one  is  more  likely  to  obtain  an  odd  simulated  dataset.  Given  certain  patterns  of 
the  responses  across  the  treatments  and  centers,  singularities  or  near  singularities  can 
occur  in  the  variance  matrix  for  the  response.  This  would  lead  to  a corruption 
of  the  score  statistic  as  well.  Secondly,  the  erroneous  values  could  be  a result  of 
the  Laplace  method  used  in  the  approximation  of  the  score  statistic  (see  (5.28)  and 
(5.29)).  For  approximating  the  inner  integral  in  (5.27),  we  used  a two-term  Taylor 
series  expansion  about  the  conditional  mode  of  the  multivariate  normal  distribution. 
We  then  integrated  these  term  with  respect  to  the  multivariate  normal  distribution 
and  ignored  the  higher  terms.  Recall  that  these  higher  terms  are  functions  of  second 
and  higher  products  of  variance  components  which  are  zero  under  the  null  hypothesis. 
These  terms,  however,  could  contribute  to  the  calculation  of  the  test  statistic  when 
using  numerical  derivatives  to  compute  the  score  vector  and  information  matrix.  To 
see  this,  we  need  to  consider  how  numerical  derivative  are  calculated. 

Consider  a function  /(x)  for  which  the  first  and  second  derivatives  at  x0  are 
required.  The  derivative  at  x0  can  be  approximated  by  computing 


/(x  o + a)~  /(x  o - 6 l) 
2 e 


d /(x) 

d LX 


1 


X=X0 


(5.32) 


where  t is  a unit  vector  (e.g.,  (1,0,  • • • ,0)')  with  the  position  of  the  one  depending 
on  the  component  of  x being  differentiated.  The  value  e is  a suitably  chosen  step 
length,  representing  a compromise  between  round-off  error  (cancelation  of  leading 
digits  when  subtracting  nearly  equal  numbers)  and  truncation  error  (ignoring  terms 
of  higher  order  than  e in  the  approximation).  Now  consider  the  situation  in  the  score 
test  in  which  derivatives  are  required  for  (along  with  the  other  parameters)  o\  and 
<7i2  evaluated  at  zero.  We  argued  that  the  higher  terms  e*  in  (5.29)  would  be  zero 
under  the  null  hypothesis.  But  from  (5.32)  one  can  see  that  these  terms  would  be 


242 


evaluated  using  numerical  derivatives,  albeit  at  some  small  value  e away  from  zero.  It 
is  possible  that  with  small  numbers  of  centers,  these  small  changes  in  the  numerically 
approximated  derivative  could  alter  the  computed  test  statistic  value. 

Thirdly,  the  incorrect  test  statistic  values  could  be  caused  by  a general  failure  of 
the  numerical  derivatives.  It  may  be  that,  regardless  of  including  the  higher  terms 
of  e*,  the  numerical  derivatives  just  perform  poorly  for  small  center  sizes.  Ideally 
to  check  these  latter  two  possibilities,  one  could  include  some  of  the  higher  terms 
in  e*  and  then  use  analytical  derivatives  to  compute  the  score  test  statistic.  As 
noted  before,  analytical  derivative  are  very  difficult  to  compute  for  the  approximated 
score  test.  Thus  we  first  examined  the  use  of  higher  terms  in  the  Laplace  method, 
which  we  discuss  below.  We  then  considered  computing  analytical  derivatives  and 
were  successful  using  the  software  package  Mathematica  (Wolfram  1996).  Mathemat- 
ica  is  a symbolic  mathematical  package  that  can  output  derivatives  of  complicated 
functions.  We  used  Mathematica  to  calculate  the  first  and  second  derivatives  of  the 
marginal  log-likelihood  (5.29)  and  imported  them  into  our  Ox  program.  The  resulting 
program  took  approximately  seven  times  longer  to  run,  and  the  code  for  the  analyt- 
ical derivatives  took  almost  7 MB  of  file  space.  Thus,  in  practice,  this  approach  is 
not  recommended.  We  discuss  below,  the  differences  found  between  using  numerical 
derivatives  and  analytical  derivatives  for  the  approximated  score  test. 

Consider  again  the  expansion  of  exp(Z,-(yj  | u,))  in  (5.27)  about  u£,  the  conditional 
mean  of  u(\  In  (5.28)  we  considered  only  the  first  two  terms  of  this  expansion.  We  now 
consider  the  inclusion  of  the  third-  and  fourth-degree  polynomials  of  the  components 
of  u-.  For  the  test  of  homogeneity,  u?  = v*  and  ui_c  = u*,  thus  the  additional  terms 
are  given  by 


1 d3  lj(y,Uj) 
6 d v*3 
J_  d4  lj( y;uj) 
24  d vf 


(5.33) 


(5.34) 


243 


Note  that  /i(y;Uj)  is  evaluated  at  u-  = (— ^u*,ri*).  As  before,  we  integrate  (5.33) 

°i 

and  (5.34)  with  respect  to  the  conditional  density  of  v*  | u*.  Since  this  density  is 

univariate  normal  with  mean  -^-u*  and  variance  o\ y,  the  integrations  of  (5.33) 

and  (5.34)  correspond  to  finding  the  third  and  fourth  moments  of  a normal  random 
variable  with  mean  zero  and  variance  o\  — — Thus  the  integral  of  T\  is  zero  while 
the  integral  of  T2  is 

_ 1 dA  lj(y;  Uj)  2 _ cri2 , 2 
2 8 dvf  [CT2  o\] 

To  see  if  the  inclusion  of  the  higher  order  terms  improved  the  type  I error  rates 
for  the  previous  simulations,  we  added  T2*  into  the  braced  term  of  (5.29)  and  reran 
the  simulations.  The  results  in  Table  5.7  are  very  similar  to  those  reported  in  5.6. 
That  is  not  suggest  that  the  inclusion  of  the  fourth  term  in  the  Laplace  expansion  was 
unnecessary.  In  general,  the  test  statistic  values  obtained  using  the  two-term  Laplace 
approximation  underestimated  the  true  test  statistic  value.  For  example,  the  average 
difference  between  the  score  test  values  using  the  four-term  Laplace  approximation 
and  the  two-term  Laplace  approximation  for  cluster  size  of  75  was  0.018  with  the 
maximum  difference  being  about  0.3.  Note  that  the  average  test  statistic  value  for 
the  center  size  of  75  is  near  the  expected  value  of  the  test  statistic  (2.0).  The  Monte 
Carlo  error  associated  with  this  average  is  approximately  .23.  To  better  estimate  the 
test  statistic  value,  we  ran  a final  simulation  using  a center  size  of  100  with  500  runs. 
The  simulation  took  approximately  130  hours  to  run.  The  resulting  rejection  rates  at 
a = 0.10,  0.05,  and  0.01  were  0.11,  0.06,  and  0.01,  respectively,  with  an  average  test 
statistic  value  of  2.07  (Monte  Carlo  error  of  .11).  Thus,  the  approximate  test  does 
seem  to  perform  better  as  the  number  of  centers  is  increased. 

As  noted  before,  we  also  programmed,  using  Mathematica,  the  score  test  us- 
ing analytical  derivatives  and  the  four  term  Laplace  approximation.  The  resulting 


244 


Table  5.7:  Rejection  rates  and  average  score  test  value  for  the  adaptive  Gauss-Hermite 
quadrature  approximated  score  test  for  testing  of  a common  association  parameter  in 
the  heterogeneous  association  model  (5.31)  with  = 30  and  the  additional  fourth 
term  of  the  Laplace  expansion. 


NUMBER  OF 
CENTERS 

TYPE  I a RATE 
0.10  0.05  0.01 

AVERAGE  SCORE 
TEST  VALUE 

8 

0.23 

0.22 

0.12 

0.327 

30 

0.25 

0.18 

0.11 

3.453 

50 

0.18 

0.14 

0.04 

2.436 

75 

0.11 

0.08 

0.02 

1.972 

program  took  extremely  long  to  run,  due  to  the  evaluation  of  the  complicated  deriva- 
tives. Fortunately,  we  found  that  the  numerical  derivatives  provided  results  within 
three  decimal  accuracy  of  the  analytical  derivatives.  Thus,  one  can  utilize  the  much 
simpler  numerical  derivatives  for  calculating  the  approximated  score  test,  without 
sacrificing  substantial  accuracy. 

It  is  evident  from  these  simulation  studies  that  the  adaptive  Gauss-Hermite  ap- 
proximated score  test  is  not  appropriate  for  the  testing  of  a common  association 
parameter  in  the  heterogeneous  random  effects  model.  Given  the  results  of  the  sim- 
ulation study  in  the  previous  section,  this  is  not  surprising.  With  small  to  moderate 
numbers  of  centers,  it  is  difficult  to  accurately  estimate  the  covariance  matrix  of  the 
random  effects.  However,  we  have  seen  that  the  association  parameter,  and  it  stan- 
dard error,  can  be  estimated  with  minor  bias.  Even  though  one  can  not  accurately 
estimate  the  variability  of  the  association  parameter  or  test  for  its  significance,  inclu- 
sion of  the  additional  random  effect  provides  standard  error  estimates  that  reflect  the 
suspected  heterogeneity.  Thus  we  still  recommend  that  one  fits  the  more  complicated 
heterogeneity  model.  Though  it  performed  poorly  for  this  application,  the  adaptive 
score  test  seems  promising  for  other  applications  in  which  the  number  of  clusters  or 
subjects  is  high.  Rigorous  proof  of  the  distribution  of  the  approximated  score  test 
is  still  required,  as  is  more  simulation  work  for  studying  its  behavior  under  other 
sampling  schemes. 


245 


5.6  Application 

We  conclude  this  chapter  by  applying  the  methods  proposed  and  discussed  in  this 
chapter  to  Table  5.1.  Recall  that  this  table  shows  preliminary  results  from  a double- 
blind, parallel-group  clinical  trial  conducted  at  a number  of  centers.  The  purpose  of 
the  study  was  to  compare  an  active  drug,  for  treating  asthma,  to  a placebo.  At  the 
end  of  the  study,  researchers  evaluated  the  patient’s  change  in  condition  using  a three 
point  scale  (much  better,  better,  unchanged  or  worse).  We  will  utilize  the  cumulative 
logit  link  for  modeling  the  ordinal  response,  though  similar  models  could  be  fit  using 
the  adjacent-category  logit  link  as  well.  We  concentrate  on  the  interpretation  of  the 
association  parameter  in  this  model,  with  Table  5.8  providing  a summary  for  the 
majority  of  the  models. 

We  begin  by  reporting  results  for  the  homogeneous  (5.1)  and  heterogeneous  (5.2) 
fixed  effects  model.  For  the  cumulative  logit  link,  the  treatment  effect  estimate,  /3,  is 
an  estimate  of  the  cumulative  log  odds  ratio.  For  the  model  that  assumes  a common 
association  across  all  centers,  the  estimated  log  odds  ratio  is  = .93  with  a standard 
error  of  .28.  Thus,  for  this  model,  one  obtains  a significant  treatment  effect.  The 
estimated  odds  that  the  evaluation  for  the  active  drug  falls  below  any  fixed  level 
are  exp(.93)  = 2.5  times  the  estimated  odds  for  the  placebo.  If  one  relaxes  the 
common  association  assumption  and  fits  the  heterogeneous  fixed  effects  model  (5.2), 
the  estimated  center  association  parameters  {/?;}  range  from  /32  = —1.62  to  = 3.03 
(see  Table  5.9).  The  likelihood-ratio  statistic  for  comparing  the  heterogeneous  and 
homogeneous  fixed  effects  models  is  24.8  on  seven  degrees  of  freedom  (P  < 0.001) 
giving  strong  evidence  that  the  association  parameter  varies  across  centers. 

For  the  random  effects  approaches  we  first  consider  the  simple  homogeneous  model 
(5.4),  in  which  only  the  centers  are  assumed  random  and  a common  association 
parameter  is  estimated  for  all  centers.  We  begin  by  assuming  that  the  distribution 
of  the  random  center  effect  is  normal.  For  this  model,  one  obtains  similar  results  to 


246 


the  fixed  effects  homogeneous  model,  with  a log  odds  ratio  estimate  of  /5  = .95  and  a 
standard  error  of  .28.  The  estimated  standard  deviation  of  the  center  random  effect 
is  .60.  For  comparison,  we  also  fit  the  homogeneous  random  effects  model  that  allows 
for  varying  thresholds.  In  this  model  each  of  the  two  thresholds  are  assumed  to  be 
random.  The  estimated  log  odds  ratio  from  this  model  is  0.93  with  a standard  error 
of  .28.  Thus  results  are  substantially  the  same  for  both  approaches.  One  could  also 
obtain  estimates  for  the  homogeneous  random  effects  model  under  the  assumption  of 
a discrete  distribution  for  the  random  effect.  Using  the  NPML  algorithm  of  Section 
4.2,  the  estimated  log  odds  ratio  is  .94  with  a standard  error  of  .28.  As  has  been 
seen  before,  the  parametric  and  nonparametric  approaches  provide  results  that  are 
in  close  agreement. 

The  heterogeneous  random  effects  model  (5.7)  allows  for  a random  interaction  be- 
tween the  center  and  treatment  effect,  in  addition  to  the  random  center  effect.  Thus 
we  obtain  both  an  expected  value  for  the  log  odd  ratios  of  the  centers,  as  well  as  an 
estimate  of  its  variability.  For  the  normal  random  effects  version  of  this  model,  the 
average  log  odds  ratio  estimate  is  0.92  with  a standard  error  of  .53.  Thus  we  obtain  a 
similar  estimate  of  the  treatment  effect,  but  a considerably  larger  standard  error  esti- 
mate. This  is  due  to  the  additional  variance  component  for  the  center-by-treatment 
interaction  and  reflects  the  uncertainty  in  the  assumption  of  a common  association 
parameter.  The  estimated  standard  deviation  of  the  log  odds  ratios  among  the  cen- 
ters is  1.22.  Using  the  proposed  score  tests  in  the  previous  section,  we  can  test  that 
the  variance  component  for  the  random  interaction  term  is  zero.  Using  the  Laplace 
approximate  score  test,  which  assumes  that  the  random  effects  are  independent  and 
has  a standard  normal  distribution  under  the  null  hypothesis,  the  score  test  value  was 
3.60  (P  < 0.01).  Using  the  adaptive  quadrature  approximated  score  test,  assuming 
that  the  random  effects  are  independent,  the  score  test  value  was  2.38  (P  = 0.12). 
For  the  test  that  allows  for  correlated  random  effects,  the  value  was  2.62  (P  = 0.27). 


247 


Recall,  however,  that  the  adaptive  quadrature  approximated  score  test  performed 
poorly  for  small  numbers  of  centers  in  the  simulations  of  the  previous  section.  Thus, 
these  latter  two  values  are  suspect.  In  Table  5.9  we  report  the  predicted  cumulative 
log  odds  ratio  estimates  for  the  heterogeneous  random  effects  model.  These  estimates 
are  functions  of  the  expected  value  of  the  random  interaction  effect  given  the  data  for 
each  center  and  the  final  parameter  estimates.  For  comparison,  we  have  included  the 
log  odds  ratio  estimates  from  the  heterogeneous  fixed  effects  model  as  well.  One  can 
see  that  the  estimates  obtained  from  the  random  effects  model  are  much  smoother,  as 
the  estimate  for  each  center  contains  information  “borrowed”  from  all  other  centers. 

If  one  relaxes  the  normality  assumption  for  the  random  effects  distribution  and 
fits  the  heterogeneous  random  effects  model  nonparametrically,  one  obtains  an  av- 
erage cumulative  log  odds  ratio  estimate  of  .98  with  a standard  error  of  .53.  The 
NPML  estimate  of  the  mixing  distribution  is  a four-point  bivariate  distribution  for 
the  random  effects  (u*,  v*).  The  estimated  mass  points  were  (.20,  .16),  (-1.32,  2.91), 
(-.87,  -.04), and  (-2.28,  1.71),  with  corresponding  probabilities  (.25,  .25,  .37,  .13). 
Again  we  see  similar  results  for  the  parametric  and  nonparametric  approaches,  even 
for  the  bivariate  random  effects  case.  Table  5.9  contains  the  predicted  cumulative 
log  odds  ratio  estimates  for  the  heterogeneous  random  effects  model  using  the  NPML 
approach.  It  is  unclear  how  one  should  calculate  standard  errors  for  these  predictions. 
Naive  estimates  could  be  obtained  using  the  maximum  likelihood  estimates  for  the 
model  parameters  along  with  the  usual  formula  for  the  variance  of  the  interaction 
component  given  the  data.  However  this  approach  does  not  account  for  plugging  in 
the  parameter  estimates.  One  might  use  a similar  approach  to  that  of  Booth  and 
Hobert  (1998)  to  account  for  the  plug  in  estimates.  Note  that  the  NPML  estimates 
seem  to  form  four  clusters.  The  first  cluster  consists  of  centers  with  very  large  pre- 
dicted log  odd  ratios  (centers  1 and  5).  The  second  cluster  consists  of  moderate 
sized  log  odds  ratios  and  contains  only  center  7.  The  third  cluster  (centers  6 and  8) 


248 


Table  5.8:  Estimated  treatment  log  odds  ratio  and  standard  error  for  various  cumu- 
lative logit  models  with  Table  5.1. 


EFFECT 

CENTER 

RANDOM  EFFECT 
DISTRIBUTION 

P 

STANDARD 

ERROR 

Homogeneous 

Fixed 

— 

.932 

.278 

Random 

Normal 

.947 

.276 

Nonparametric 

.938 

.282 

Varying 

Normal 

.931 

.282 

Heterogeneous 

Random 

Normal 

.923 

.526 

Nonparametric 

.978 

.530 

consists  of  small  predicted  log  odds  ratios.  The  final  cluster  consists  of  those  centers 
that  had  predicted  logs  odds  ratios  near  zero  (centers  2,  3,  and  4).  We  suspect  that 
the  clustering  is  due  to  the  support  size  of  the  discrete  distribution  being  four. 

In  this  chapter  we  applied  the  ordinal  multinomial  random  effects  models  of  Chap- 
ters 3 and  4 to  data  from  multi-center  clinical  trials.  We  have  seen  that  such  models 
can  be  used  to  incorporate  heterogeneity  in  both  the  centers  and  in  the  association 
parameter  across  centers.  However,  the  estimates  of  the  heterogeneity  can  be  very 
poor  when  the  cluster  size  is  small  to  moderate.  In  these  situations  we  have  also  seen 
that  tests  concerning  these  parameters  perform  poorly  as  well.  The  heterogeneous 
random  effects  model  is  useful  for  inflating  the  standard  error  of  the  association  pa- 
rameter, but  one  should  be  cautious  in  interpreting  the  estimates  obtained  for  the 


covariance  matrix. 


249 


Table  5.9:  Summary  of  center-specific  cumulative  log  odds  ratio  estimates  and  stan- 
dard errors  (SE)  for  treatment  effects  with  fixed  and  random  effects  heterogeneity 
models  applied  to  Table  5.1.  ML  denotes  maximum  likelihood  estimates  from  the 
parametric  approach,  while  NPML  denotes  maximum  likelihood  estimates  from  the 
nonparametric  approach. 

RANDOM  EFFECTS 


FIXED  EFFECTS  MODEL  (5.7) 

MODEL  (5.2)  ML  NPML 


EFFECT 

ESTIMATE 

SE 

ESTIMATE 

SE 

ESTIMATE 

Center  1 

3.03 

0.87 

2.35 

0.75 

2.91 

Center  2 

-1.62 

0.95 

-0.62 

0.92 

-0.01 

Center  3 

0.20 

0.55 

0.32 

0.52 

-0.04 

Center  4 

0.71 

0.85 

0.76 

0.72 

0.03 

Center  5 

2.84 

0.95 

2.11 

0.83 

2.88 

Center  6 

-1.06 

1.21 

-0.10 

0.94 

0.16 

Center  7 

1.76 

0.87 

1.53 

0.73 

1.71 

Center  8 

0.83 

0.82 

0.84 

0.73 

0.18 

CHAPTER  6 
CONCLUSIONS 

6.1  Summary  of  Results 

In  this  dissertation,  we  have  developed  methods  for  modeling  longitudinal  or  clus- 
tered data  with  nominal  or  ordinal  responses.  We  have  concentrated  on  four  link 
functions  based  on  the  logit  link:  the  baseline-category  logit  link  for  nominal  re- 
sponses and  the  adjacent-category,  continuation-ratio,  and  cumulative  logit  links  for 
ordinal  responses.  To  account  for  heterogeneity  among  the  clustered  or  repeated  ob- 
servations, we  introduced  random  effects  linearly  in  the  linear  predictor  with  the  fixed 
effects.  We  motivated  the  models  from  the  framework  of  a multivariate  generalized 
linear  mixed  model,  yielding  a general  approach  for  modeling  clustered  multinomial 
response  data.  We  considered  both  parametric  and  nonparametric  assumptions  for 
the  distribution  of  the  random  effects.  For  both  approaches  we  proposed  algorithms 
for  obtaining  maximum  likelihood  estimates  of  the  fixed  effects  parameters  and  the 
random  effects  distribution.  We  utilized  a number  of  simulation  studies  to  compare 
the  two  approaches,  as  well  as  to  investigate  inferential  methods  for  the  nonparamet- 
ric approach.  We  then  examined  the  use  of  the  proposed  methods  for  data  arising 
from  multi-center  clinical  trials,  and  concluded  by  proposing  a score  test  for  testing 
that  a common  association  holds  for  all  centers. 

Random  effects  models  provide  a useful  method  for  accounting  for  correlation 
among  repeated  or  clustered  observations.  Through  the  use  of  the  multivariate  gener- 
alized linear  mixed  model,  we  have  outlined  a general  approach  for  modeling  clustered 
multinomial  responses  data.  When  the  random  effects  are  assumed  to  be  normally 
distributed,  one  must  have  accurate  and  efficient  methods  for  approximating  the  in- 
tractable integrals.  We  have  found  the  adaptive  Gauss-Hermite  quadrature  approach 


250 


251 


for  approximating  integrals,  coupled  with  the  direct  maximization  of  the  marginal 
log-likelihood  to  be  superior  to  the  proposed  algorithms  for  nominal  and  ordinal 
clustered  data.  Specifically,  we  have  seen  that  the  Monte  Carlo  EM  algorithm  of 
Tutz  and  Hennevogl  (1996)  can  provide  estimates  that  vary  quite  dramatically  de- 
pending on  the  Monte  Carlo  sample  size  that  was  used.  Indeed,  we  feel  that  their 
chosen  sample  sizes  were  much  too  small  to  provide  accurate  estimates.  In  addition, 
the  adaptive  Gauss-Hermite  approach  is  more  efficient  than  the  Gauss-Hermite  ap- 
proach of  Hedeker  and  Gibbons  (1994)  as  it  provides  greater  accuracy  for  the  integral 
approximations  with  fewer  quadrature  points.  Thus  one  can  fit  models  with  greater 
numbers  of  random  effects,  while  obtaining  accurate  approximations  for  the  integrals. 
In  contrast  to  recommendations  by  Hedeker  and  Gibbons  (1994),  we  recommend  that 
the  number  of  quadrature  points  be  increased  with  increasing  integral  dimensions,  or 
at  least  set  between  10  and  15  for  each  dimension.  Under  these  recommendations, 
adaptive  Gauss-Hermite  quadrature  is  reasonable  for  models  with  up  to  five  or  six 
random  effects. 

The  examples  considered  in  this  dissertation  had  only  a small  number  of  random 
effects.  For  high  dimensional  random  effects,  the  adaptive  quadrature  approach  will 
be  computationally  too  intensive  (at  least  with  the  current  computing  power).  For 
such  models,  a Monte  Carlo  approach  is  more  appropriate.  We  have  shown  that  the 
automated  Monte  Carlo  EM  approach  of  Booth  and  Hobert  (1999)  can  also  be  used  for 
the  multinomial  random  effects  models.  It  is  important  to  have  good  starting  values 
that  are  at  least  close  to  the  true  maximum  likelihood  estimates  to  help  reduce  the 
computation  time  for  the  EM  algorithm.  The  proposed  pseudo-likelihood  approach 
for  the  multinomial  random  effects  models  is  one  way  of  obtaining  such  estimates. 
We  have  seen  that  it  can  provide  reasonable  estimates,  though  it  will  most  likely  have 
similar  small  sample  behavior  problems  as  its  binomial  counterpart.  Nonetheless,  it 


252 


provides  a fast  method  for  carrying  out  exploratory  random  effects  modeling  when 
the  response  is  nominal  or  ordinal. 

Under  the  assumption  of  normality  for  the  random  effects,  we  also  examined  the 
varying  cumulative  threshold  model  of  Tutz  and  Hennevogl  (1996).  Though  such  a 
model  allows  increased  flexibility  in  modeling  a repeated  ordinal  response,  we  found  it 
to  be  very  unstable  and,  for  the  example  considered,  it  provided  similar  results  to  the 
shifted  threshold  model.  It  is  difficult  to  know  how  often  such  variation  occurs  in  “real 
life”  datasets.  Indeed,  the  shifted  threshold  model  produced  biased  estimates  of  the 
regression  parameter  when  we  simulated  from  a varying  threshold  model.  However, 
as  the  correlation  between  thresholds  neared  1.0  and  the  variation  in  the  thresholds 
became  more  similar,  the  shifted  threshold  model  performed  adequately.  Since,  in 
practice,  one  would  not  have  the  software  to  fit  the  extended  model,  we  recommend 
the  use  of  the  shifted  threshold  model  as  it  certainly  will  provide  better  estimates 
than  ignoring  the  variability  completely. 

An  interesting  alternative  to  the  usual  assumption  of  normality  for  the  random 
effects  is  to  estimate  their  distribution  nonparametrically.  With  such  an  approach, 
one  could  avoid  the  possible  misspecification  of  the  random  effects  distribution.  We 
investigated  this  approach  and  proposed  an  EM  algorithm  for  obtaining  the  non- 
parametric  maximum  likelihood  estimates  of  the  fixed  parameters  and  the  mixing 
distribution  in  the  multinomial  random  effects  model.  We  found  that  this  approach 
provided  very  similar  results  to  the  parametric  approach  in  the  examples  that  we  con- 
sidered. Simulating  from  a variety  of  random  effects  distributions,  we  found  that  the 
nonparametric  approach  behaved  similarly  to  the  parametric  approach.  In  fact,  the 
parametric  approach  had,  in  general,  very  little  bias  in  the  regression  parameter  esti- 
mate even  when  the  true  random  effects  distribution  was  far  from  normal.  Thus,  as 
seen  by  others,  estimation  under  the  assumption  of  normality  for  the  random  effects 
seems  to  be  fairly  robust  to  misspecification.  We  also  examined  the  use  of  standard 


253 


maximum  likelihood  inferential  procedures  for  the  NPML  approach  and  found  that 
the  Wald  and  likelihood-ratio  tests  performed  similarly  to  the  corresponding  tests  in 
the  parametric  approach.  Though  the  asymptotic  theory  does  not  exist  for  these  tests 
when  using  the  NPML  approach,  they  seem  to  provide  at  least  approximately  cor- 
rect inferences.  An  important  issue  in  any  mixture  model  is  the  identihability  of  the 
model  parameters.  We  proved  a series  of  identihability  theorems  for  general  mixtures 
of  multinomial  distributions,  and  used  them  to  address  identihability  in  the  multino- 
mial NPML  model.  The  NPML  model  provides  a useful  alternative  to  the  parametric 
approach  for  htting  random  effects  models.  From  our  work  it  seems  that  both  ap- 
proaches will  provide  similar  results  for  most  situations.  As  the  NPML  approach  can 
converge  to  estimated  mass  points  at  plus  and  minus  infinity,  we  recommend  that 
one  should  not  place  too  much  faith  in  the  estimated  mixing  distribution. 

As  software  for  random  effects  modeling  becomes  more  readily  available,  more  and 
more  researchers  will  hnd  applications  for  its  use.  We  considered  one  such  application, 
namely,  for  analyzing  ordinal  response  data  from  multi-center  clinical  trials.  If  one  is 
willing  to  treat  the  centers  as  random,  one  can  model  both  heterogeneity  in  the  centers 
and  the  association  parameter.  As  such  data  typically  has  small  numbers  of  centers, 
it  is  questionable  if  the  heterogeneous  random  effects  model  can  provide  accurate 
results.  In  our  simulations  we  found  that  there  was  a large  amount  of  variability  in 
the  variance  component  estimates,  especially  when  the  number  of  clusters  was  small. 
However,  the  estimates  of  the  association  parameter  were  fairly  accurate,  as  were 
its  corresponding  standard  error.  We  also  proposed  score  tests  for  testing  that  a 
common  association  hold  for  all  centers.  We  first  extended  the  Laplace  approximated 
score  test  of  Lin  (1997),  and  then  proposed  an  adaptive  Gauss-Hermite  quadrature 
approximated  score  test.  Simulation  studies  showed  that  the  adaptive  quadrature 
score  test  performed  very  poorly  for  small  to  moderate  center  sizes.  Given  this,  and 


254 


the  poor  estimates  for  the  variance  components,  one  should  be  cautious  in  making 
any  strong  inferential  statements  concerning  the  variance  components. 

6.2  Future  Research 

There  are  a number  of  areas  in  this  dissertation  in  which  future  research  is  possi- 
ble. Though  adaptive  Gauss-Hermite  quadrature  seems  to  provide  accurate  approx- 
imation of  integrals,  and  thus  of  the  fixed  and  random  effects  parameters,  there  is 
currently  not  a method  for  evaluating  the  error  in  the  approximations.  Error  formu- 
las exist  for  Gauss-Hermite  quadrature,  but  their  evaluation  requires  the  calculation 
of  derivatives  of  extremely  high  order.  As  an  alternative,  one  could  use  a similar, 
direct  maximization  of  the  log-likelihood  approach  but  with  a Monte  Carlo  tech- 
nique for  integration  so  that  one  could  evaluate  the  Monte  Carlo  error.  That  is,  use 
an  approach  similar  to  the  automated  EM  algorithm  of  Booth  and  Hobert  (1999), 
but  with  a direct  maximization  algorithm.  The  direct  maximization  approach  with 
Monte  Carlo  integration  was  studied  by  McCulloch  (1997)  and  was  shown  to  perform 
poorly.  However,  we  feel  that  he  did  not  choose  the  correct  candidate  distribution  for 
simulation.  Since  adaptive  quadrature  seems  to  work  so  well,  one  could  use  a multi- 
variate normal  distribution  with  the  same  mean  and  curvature  as  the  integrand  being 
approximated.  This  parallels  exactly  what  is  done  with  adaptive  quadrature.  One 
would  then  develop  the  appropriate  asymptotic  results  to  assess  the  Monte  Carlo 
error  in  the  approximations.  In  addition,  it  would  be  of  interest  to  compare  the 
adaptive  quadrature  and  Monte  Carlo  direct  maximization  approaches  and  see,  for 
example,  how  many  samples  are  needed  in  the  latter  to  obtain  the  estimates,  to  a 
certain  degree  of  accuracy,  in  the  former. 

There  are  a number  of  open  questions  with  regard  to  the  NPML  approach  that 
still  require  answers.  For  one,  the  asymptotic  theory  for  fixed  effect  tests  is  still  un- 
known. Simulations  suggest  that  the  usual  maximum  likelihood  approaches  perform 


255 


adequately,  however,  the  theoretical  justification  is  still  needed.  In  addition,  it  is  im- 
portant to  develop  methods  for  testing  the  number  of  mass  points  in  the  distribution. 
In  our  algorithm,  the  number  of  mass  points  was  increased  until  the  deviance  no 
longer  changed.  There  may  be  situations,  however,  in  which,  statistically,  fewer  mass 
points  are  required.  Such  a test  is  under  nonstandard  conditions,  and  thus  a score 
test  might  be  possible.  More  research  is  also  needed  to  completely  specify  the  identi- 
fiability  conditions  for  models  with  multiple  random  effects.  The  work  of  Butler  and 
Louis  (1997)  has  addressed  some  of  these  issues  for  binary  random  effects  model.  In 
addition,  for  the  particular  models  considered  here,  one  could  modify  the  algorithm 
to  allow  for  mass  points  at  plus  or  minus  infinity.  It  would  then  be  interesting  to 
compare  the  results  of  the  two  models. 

Finally,  more  research  is  needed  in  studying  the  adaptive  Gauss-Hermite  approx- 
imated score  test.  We  have  seen  that  it  performs  poorly  for  small  numbers  of  centers 
or  clusters.  Our  simulations  suggested,  however,  that  the  approximated  score  test 
would  perform  adequately  for  larger  cluster  sizes.  Thus,  one  could  examine  other 
data  situations,  such  as  longitudinal  studies,  where  the  number  of  clusters  is  large. 
A thorough  examination  of  the  null  hypothesis  distribution  is  also  needed,  especially 
when  the  hypothesis  implies  that  other  covariance  terms  are  zero. 


REFERENCES 


Adams,  J.,  Wilson,  M.,  and  Wang,  W.  (1997),  “The  Multidimensional  Random  Coeffi- 
cients Multinomial  Logit  Model,”  Applied  Psychological  Measurement , 21,  1-23. 

Adams,  R.  J.  and  Wilson,  M.  R.  (1996),  “Formulating  the  Rasch  Model  as  a Mixed 
Coefficients  Multinomial  Logit,”  In  Engelhard,  G.  and  Wilson,  M.  R.,  editors,  Ob- 
jective Measurement:  Theory  and  Practice,  volume  3,  pages  143-166,  Norwood,  NJ: 
Ablex. 

Agresti,  A.  (1990),  Categorical  Data  Analysis,  New  York:  Wiley. 

Agresti,  A.  (1993a),  “Computing  Conditional  Maximum  Likelihood  Estimates  for  Gen- 
eralized Rasch  Models  Using  Simple  Loglinear  Models  with  Diagonal  Parameters,” 
Scandinavian  Journal  of  Statistics,  20,  63-71. 

Agresti,  A.  (1993b),  “Distribution-free  Fitting  of  Logit  Models  With  Random  Effects 
of  Repeated  Categorical  Responses,”  Statistics  in  Medicine,  12,  1969-1987. 

Agresti,  A.  and  Hartzel,  J.  (1999),  “Tutorial  in  Biostatistics:  Strategies  for  Comparing 
Treatments  on  a Binary  Response  with  Multi-Center  Data,”  Statistics  in  Medicine, 
in  press. 

Agresti,  A.  and  Lang,  J.  B.  (1993),  “A  Proportional  Odds  Models  with  Subject-specific 
Effects  for  Repeated  Ordered  Categorical  Responses,”  Biometrika,  80,  527-534. 

Aitkin,  M.  (1996),  “A  General  Maximum  Likelihood  Analysis  of  Overdispersion  in 
Generalized  Linear  Models,”  Statistics  and  Computing,  6,  251-262. 

Aitkin,  M.  (1999),  “A  General  Maximum  Likelihood  Analysis  of  Variance  Components 
in  Generalized  Linear  Models,”  Biometrics,  55,  117-128. 

Aitkin,  M.  and  Aitkin,  I.  (1995),  “A  Hybrid  EM/Gauss-Newton  Algorithm  for  Maxi- 
mizing Likelihood  in  Mixture  Distributions,”  Statistics  and  Computing,  6,  127-130. 

Aitkin,  M.  and  Alfo,  M.  (1998),  “Regression  Models  for  Binary  Longitudinal  Re- 
sponses,” Statistics  and  Computing , 8,  289-307. 

Andersen,  E.  B.  (1973),  “Conditional  Inference  for  Multiple-Choice  Questionnaires,” 
British  Journal  of  Mathematical  and  Statistical  Psychology,  26,  31-44. 

Andersen,  E.  B.  (1980),  Discrete  Statistical  Models  With  Social  Science  Applications, 
Amsterdam:  North-Holland/Elsevier. 


256 


257 


Anderson,  D.  A.  and  Aitkin,  M.  (1985),  “Variance  Component  Models  with  Binary 
Response:  Interviewer  Variability,”  J.  Roy.  Statis.  Soc.  B , 47,  203-210. 

Bartholomew,  D.  (1987),  Latent  Variable  Models  and  Factor  Analysis , New  York:  Ox- 
ford Press. 

Ben-Akiva,  M.  and  Lerman,  S.  (1985),  Discrete  Choice  Analysis,  Cambridge:  The  MIT 
Press. 

Bohning,  D.  (1995),  “A  Review  of  Reliable  Maximum  Likelihood  Algorithms  for  Semi- 
parametric  Mixture  Models,”  Journal  of  Statistical  Planning  and  Inference,  47,  5-28. 

Bohning,  D.,  Dietz,  E.,  Schaub,  R.,  Schlattmann,  P.,  and  Lindsay,  B.  (1994),  “The 
Distribution  of  the  Likelihood  Ratio  for  Mixtures  of  Densities  from  the  One-Parameter 
Exponential  Family,”  Annals  of  the  Institute  of  Statistical  Mathematics,  46,  373-388. 

Booth,  J.  G.  and  Hobert,  J.  P.  (1998),  “Standard  Errors  of  Prediction  in  Generalized 
Linear  Mixed  Models,”  Journal  of  the  American  Statistical  Association,  93,  262-272. 

Booth,  J.  G.  and  Hobert,  J.  P.  (1999),  “Maximizing  Generalized  Linear  Mixed  Model 
Likelihoods  with  an  automated  Monte  Carlo  EM  Algorithm,”  Journal  of  the  Royal 
Statistical  Society,  Series  B,  61,  265-285. 

Breslow,  N.  E.  and  Clayton,  D.  G.  (1993),  “Approximate  Inference  in  Generalized 
Linear  Mixed  Models,”  Journal  of  the  American  Statistical  Association,  88,  9-25. 

Breslow,  N.  E.  and  Lin,  X.  (1995),  “Bias  Correction  in  Generalised  Linear  Mixed  Models 
With  a Single  Component  of  Dispersion,”  Biometrika,  82,  81-91. 

Broyden,  C.  (1970),  “The  Convergence  of  a Class  of  Double-rank  Minimization  Algo- 
rithms,” Journal  of  the  Institute  of  Mathematics  and  its  Applications,  6,  76-90. 

Bryk,  A.  and  Raudenbush,  A.  (1992),  Hierarchical  Linear  Models , Thousand  Oaks, 
California:  Sage  Publications,  Inc. 

Butler,  S.  M.  and  Louis,  T.  A.  (1992),  “Random  Effects  Models  with  Nonparametric 
Priors,”  Statistics  in  Medicine,  11,  1981-2000. 

Butler,  S.  M.  and  Louis,  T.  A.  (1997),  “Consistency  of  Maximum  Likelihood  Estimators 
in  General  Random  Effects  Models  for  Binary  Data,”  The  Annals  of  Statistics,  25, 
351-377. 

Chan,  J.  S.  K.  and  Kuk,  A.  Y.  C.  (1997),  “Maximum  Likelihood  Estimation  for  Probit- 
Linear  Mixed  Models  with  Correlated  Random  Effects,”  Biometrics,  53,  86-97. 

Chant,  D.  (1974),  “On  Asymptotic  Tests  of  Composite  Hypotheses  in  Nonstandard 
Conditions,”  Biometirika,  61,  291-298. 

Clogg,  C.  (1979),  “Some  Latent  Structure  Models  for  the  Analysis  of  Likert-type  Data,” 
Social  Science  Research,  8,  287-301. 


258 


Collett,  D.  (1991),  Modelling  Binary  Data , London:  Chapman  and  Hall. 

Conaway,  M.  (1989),  “Analysis  of  Repeated  Categorical  Measurements  with  Conditional 
Likelihood  Methods,”  Journal  of  the  American  Statistical  Association,  84,  53-62. 

Conaway,  M.  R.  (1990),  “A  Random  Effects  Model  for  Binary  Data,”  Biometrics,  46, 
317-328. 

Coull,  B.  and  Agresti,  A.  (2000),  “Random  Effects  Modeling  of  Multiple  Binary  Re- 
sponses Using  the  Multivariate  Binomial  Logit-Normal  Distribution,”  Biometrics, 
56,  162-168. 

Cox,  D.  R.  (1972),  “Regression  Models  and  Life-Tables  (with  Discussion),”  Journal  of 
the  Royal  Statistical  Society  Series  B,  34,  187-220. 

Davies,  R.  (1993),  “Nonparametric  Control  for  Residual  Heterogeneity  in  Modelling 
Recurrent  Behavior,”  Computational  Statistics  & Data  Analysis,  16,  143-160. 

Davies,  R.  and  Pickles,  A.  (1987),  “A  Joint  Trip  Timing,  Store  Choice  Model  Including 
Feedback  Effects  and  Nonparametric  Control  for  Omitted  Variables,”  Transportation 
Research  A,  21,  345-361. 

Davies,  R.  B.  (1987),  “Mass  Point  Methods  for  Dealing  with  Nuisance  Parameters  in 
Longitudinal  Studies,”  In  Crouchley,  R.,  editor,  Longitudinal  Data  Analysis,  pages 
88-109,  Hants,  England:  Avebury,  Aldershot. 

Dempster,  A.  P.,  Laird,  N.  M.,  and  Rubin,  D.  A.  (1977),  “Maximum  Likelihood  Esti- 
mation From  Incomplete  Data  Via  the  EM  Algorithm  (with  Discussion),”  Journal  of 
the  Royal  Statistical  Society  Series  B,  39,  1-38. 

Dietz,  E.  and  Bohning,  D.  (1995),  “Statistical  Inference  Based  on  a General  Model  of 
Unobserved  Heterogeneity,”  In  Steckel-Berger,  G.,  Francis,  B.,  Hatzinger,  R.,  and 
Seeber,  G.,  editors,  Lecture  Notes  in  Statistics:  Statistical  Modelling,  volume  104, 
pages  75-82,  New  York:  Springer- Verlag. 

Diggle,  P.  J.,  Liang,  K.-Y.,  and  Zeger,  S.  L.  (1994),  Analysis  of  Longitudinal  Data, 
Oxford:  Clarendon  Press. 

Doornick,  J.  A.  (1998),  Object-Oriented  Matrix  Programming  Using  Ox  2.0,  Kent, 
England:  Timberlake  Consultants,  Ltd. 

Efron,  B.  and  Hinkley,  D.  V.  (1978),  “Assessing  the  Accuracy  of  the  Maximum  Like- 
lihood Estimator:  Observed  Versus  Expected  Fisher  Information  (C/R:  p482-487),” 
Biometrika,  65,  457-481. 

Engel,  B.  (1998),  “A  Simple  Illustration  of  the  Failure  of  PQL,  IRREML  and  APHL  as 
Approximate  ML  Methods  for  Mixed  Models  for  Binary  Data,”  Biometrical  Journal, 
40,  141-154. 


259 


Engel,  B.  and  Keen,  A.  (1994),  “A  Simple  Approach  for  the  Analysis  of  Generalized 
Linear  Mixed  Models,”  Statistica  Neerlandica,  48,  1—22. 

Ezzet,  F.  and  Whitehead,  J.  (1991),  “A  Random  Effects  Model  For  Ordinal  Responses 
From  A Crossover  Trial,”  Statistics  in  Medicine,  10,  901-907. 

Fahrmeir,  L.  and  Tutz,  G.  (1994),  Multivariate  Statistical  Modelling  Based  on  Gener- 
alized Linear  Models,  New  York:  Springer- Verlag  New  York,  Inc. 

Fienberg,  S.  E.  (1986),  “The  Rasch  Model,”  In  Encyclopedia  of  Statistical  Sciences, 
volume  7,  pages  627-632,  New  York:  Wiley. 

Fleiss,  J.  (1986),  “Analysis  of  Data  From  Multiclinic  Trials,”  Controlled  Clinical  Trials, 
10,  237-243. 

Fletcher,  R.  (1970),  “A  New  Approach  to  Variable  Metric  Algorithms,”  Computer 
Journal,  13,  317-322. 

Follmann,  D.  A.  and  Lambert,  D.  (1989),  “Generalizing  Logistic  Regression  By  Non- 
parametric  Mixing,”  Journal  of  the  American  Statistical  Association,  84,  295-300. 

Follmann,  D.  A.  and  Lambert,  D.  (1991),  “Identihability  of  Finite  Mixtures  of  Logistic 
Regression  Models,”  Journal  of  Statistical  Planning  and  Inference,  27,  375-381. 

Geweke,  J.  (1996),  Handbook  of  Computational  Statistics,  chapter  15,  New  York:  Else- 
vier. 

Gilmour,  A.  R.,  Anderson,  R.  D.,  and  Rae,  A.  L.  (1985),  “The  Analysis  of  Data  by  a 
Generalized  Linear  Mixed  Model,”  Biometrika,  72,  593-599. 

Goldfarb,  D.  (1970),  “A  Family  of  Variable  Metric  Methods  Derived  by  Variational 
Means,”  Mathematics  of  Computation,  24,  23-26. 

Golub,  G.  H.  (1973),  “Some  Modified  Matrix  Eigenvalue  Problems,”  SIAM  Review,  15, 
318-334. 

Graham,  A.  (1981),  Kronecker  Products  and  Matrix  Calculus  with  Applications,  Hor- 
wood,  New  York:  Halsted  Press. 

Grizzle,  J.  E.  (1987),  “Letter  to  the  Editor,”  Controlled  Clinical  Trials,  8,  392-393. 

Harville,  D.  A.  (1977),  “Maximum  Likelihood  Approaches  to  Variance  Component 
Estimation  and  to  Related  Problems  (with  Discussion),”  Journal  of  the  American 
Statistical  Association,  72,  320-385. 

Harville,  D.  A.  and  Mee,  R.  W.  (1984),  “A  Mixed-model  Procedure  for  Analyzing 
Ordered  Categorical  Data,”  Biometrics,  40,  393-408. 

Heagerty,  P.  J.  (1999),  “Marginally  Specified  Logistic-Normal  Models  for  Longitudinal 
Binary  Data,”  Biometrics,  55,  688-698. 


260 


Heckman,  J.  J.  and  Singer,  B.  (1984),  “A  Method  For  Minimizing  the  Impact  of 
Distributional  Assumptions  in  Econometric  Models  of  Duration,”  Econometrica,  52, 
271-320. 

Hedeker,  D.  (2000),  “MIXNO:  A Computer  Program  for  Mixed-Effects  Nominal  Logistic 
Regression,”  Computer  Methods  and  Programs  in  Biomedicine , in  press. 

Hedeker,  D.  and  Gibbons,  R.  D.  (1994),  “A  Random-effects  Ordinal  Regression  Model 
for  Multilevel  Analysis,”  Biometrics , 50,  933-944. 

Henderson,  C.  R.  (1975),  “Best  Linear  Unbiased  Estimation  and  Prediction  Under  a 
Selection  Model,”  Biometrics,  31,  423-447. 

Hinde,  J.  P.  (1982),  “Compound  Regression  Models,”  In  Gilchrist,  R.,  editor,  GLIM  82: 
International  Conference  for  Generalized  Linear  Models,  pages  109-121,  New  York: 
Springer. 

Hinde,  J.  P.  and  Demetrio,  C.  G.  B.  (1998),  “Overdispersion:  Models  and  Estimation,” 
Computational  Statistics  & Data  Analysis,  27,  151-170. 

Hobert,  J.  and  Casella,  G.  (1996),  “The  Effect  of  Improper  Priors  on  Gibbs  Sampling  in 
Hierarchical  Linear  Mixed  Models,”  Journal  of  the  American  Statistical  Association, 
91,  1461-1473. 

Jacqmin-Gadda,  H.  and  Commenges,  D.  (1995),  “Tests  of  Homogeneity  for  Generalized 
Linear  Models,”  Journal  of  the  American  Statistical  Association,  90,  1237-1246. 

Jansen,  J.  (1990),  “On  the  Statistical  Analysis  of  Ordinal  Data  When  Extravariation 
is  Present,”  Applied  Statistics,  39,  74-85. 

Johnson,  M.  (1987),  Multivariate  Statistical  Simulation,  New  York:  John  Wiley  & Sons, 
Inc. 

Jones,  B.,  Teather,  D.,  Wang,  J.,  and  Lewis,  J.  (1998),  “A  Comparison  of  Various 
Estimators  of  a Treatment  Difference  for  a Multi-centre  Clinical  Trial,”  Statistics  in 
Medicine,  17,  1767-1777. 

Jones,  R.  H.  (1993),  Longitudinal  Data  with  Serial  Correlation:  A State-Space  Ap- 
proach, London:  Chapman  and  Hall. 

Karim,  M.  R.  and  Zeger,  S.  L.  (1992),  “Generalized  Linear  Models  With  Random 
Effects;  Salamander  Mating  Revisited,”  Biometrics,  48,  631-644. 

Kaufmann,  H.  (1988),  “On  Existence  and  Uniqueness  of  Maximum  Likelihood  Estimates 
in  Quantal  and  Ordinal  Response  Models,”  Metrika,  35,  291-313. 

Keen,  A.  and  Engel,  B.  (1997),  “Analysis  of  a Mixed  Model  for  Ordinal  Data  by 
Iterative  Re-Weighted  REML,”  Statistica  Neerlandica,  51,  129-144. 


261 


Kiefer,  J.  and  Wolfowitz,  J.  (1972),  “Consistency  of  the  Maximum  Likelihood  Estimator 
in  the  Presence  of  Infinitely  Many  Nuisance  Parameters,”  Annals  of  Mathematical 
Statistics , 27,  887-906. 

Laird,  N.  (1978),  “Nonparametric  Maximum  Likelihood  Estimation  of  a Mixing  Distri- 
bution,” Journal  of  the  American  Statistical  Association,  73,  805-811. 

Laird,  N.  M.  and  Ware,  J.  H.  (1982),  “Random  Effects  Models  for  Longitudinal  Data,” 
Biometrics,  38,  963-974. 

Lesperance,  M.  and  Kalbfleisch,  J.  D.  (1992),  “An  Algorithm  for  Computing  the  Non- 
parametric MLE  of  a Mixing  Distribution,”  Journal  of  the  American  Statistical  As- 
sociation, 87,  120-126. 

Liang,  K.  Y.  (1987),  “A  Locally  Most  Powerful  Test  for  Homogeneity  with  Many 
Strata,”  Biometrika,  74,  259-264. 

Liang,  K.  Y.  and  Zeger,  S.  (1986),  “Longitudinal  Data  Analysis  Using  Generalized 
Linear  Models,”  Biometrika,  73,  13-22. 

Lin,  X.  (1997),  “Variance  Component  Testing  in  Generalised  Linear  Models  with  Ran- 
dom Effects,”  Biometrika,  84,  309-326. 

Lin,  X.  and  Breslow,  N.  E.  (1996),  “Bias  Correction  in  Generalised  Linear  Mixed  Mod- 
els With  Multiple  Components  of  Dispersion,”  Journal  of  the  American  Statistical 
Association,  91,  1007-1016. 

Lindsay,  B.  (1983a),  “The  Geometry  of  Mixture  Likelihoods:  A General  Theory,”  The 
Annals  of  Statistics,  11,  86-94. 

Lindsay,  B.  (1983b),  “The  Geometry  of  Mixture  Likelihoods,  Part  II:  The  Exponential 
Family,”  The  Annals  of  Statistics,  11,  783-792. 

Lindsay,  B.  (1989),  “Moment  Matrices:  Applications  in  Mixtures,”  The  Annals  of 
Statistics,  17,  722-740. 

Lindsay,  B.,  Clogg,  C.  C.,  and  Grego,  J.  (1991),  “Semiparametric  Estimation  in  the 
Rasch  Model  and  Related  Exponential  Response  Models,  Including  a Simple  Latent 
Class  Model  for  Item  Analysis,”  Journal  of  the  American  Statistical  Association,  86, 
96-107. 

Lindsey,  J.,  Jones,  B.,  and  Ebbutt,  A.  (1997),  “Simple  Models  for  Repeated  Ordinal 
Responses  with  an  Application  to  a Seasonal  Rhinitis  Clinical  Trial,”  Statistics  in 
Medicine,  16,  2873-2882. 

Lindsey,  J.  K.  (1993),  Models  for  Repeated  Measurements , Oxford:  Oxford  University 
Press. 


262 


Lipsitz,  S.  R.,  Kim,  K.,  and  Zhao,  L.  (1994),  “Analysis  of  Repeated  Categorical  Data 
Using  Generalized  Estimating  Equations,”  Statistics  in  Medicine,  13,  1149-1163. 

Lipsitz,  S.  R.,  Laird,  N.  M.,  and  Harrington,  D.  P.  (1991),  “Generalized  Estimat- 
ing Equations  for  Correlated  Binary  Data:  Using  the  Odds  Ratio  as  a Measure  of 
Association,”  Biometrika,  78,  153-160. 

Littell,  R.  C.,  Milliken,  G.  A.,  Stroup,  W.  W.,  and  Wolfinger,  R.  D.  (1996),  SAS  System 
for  Mixed  Models,  Cary,  NC:  SAS  Institute,  Inc. 

Liu,  C.  and  Rubin,  D.  (1994),  “The  ECME  Algorithm:  A Simple  Extension  of  EM  and 
ECM  with  Faster  Monotone  Convergence,”  Biometrika,  81,  633-648. 

Liu,  C.  and  Sun,  D.  (1997),  “Accerlation  of  EM  Algorithm  for  Mixture  Models  Us- 
ing ECME,”  In  Proceedings  of  the  Statistical  Computing  Section,  pages  109-114, 
Washington:  The  American  Statistical  Association. 

Liu,  I.  and  Agresti,  A.  (1996),  “Mantel-Haenszel-Type  Inference  for  Cumulative  Odds 
Ratios  with  a Stratified  Response,”  Biometrics,  52,  1223-1234. 

Liu,  Q.  and  Pierce,  D.  A.  (1994),  “A  Note  on  Gauss-Hermite  Quadrature,”  Biometrika, 
81,  624-629. 

Longford,  N.  T.  (1993),  Random  Coefficient  Models,  New  York:  Oxford  University 
Press. 

Louis,  T.  A.  (1982),  “Finding  the  Observed  Information  Matrix  When  Using  the  EM 
Algorithm,”  Journal  of  the  Royal  Statistical  Society,  Series  B,  44,  226-233. 

Maddala,  G.  S.  (1983),  Limited- Dependent  and  Qualitative  Variables  in  Econometrics, 
Cambridge:  Cambridge  University  Press. 

Mantel,  N.  and  Haenszel,  W.  (1959),  “Statistical  aspects  of  the  analysis  of  data  from 
retrospective  studies  of  disease,”  Journal  of  the  National  Cancer  Institute,  22,  719— 
748. 

Masters,  G.  (1985),  “A  Comparison  of  Latent  Trait  and  Latent  Class  Analyses  of 
Likert-type  Data,”  Psychometrika,  50,  69-82. 

McCullagh,  P.  (1980),  “Regression  Models  for  Ordinal  Data  (With  Discussion),”  Jour- 
nal of  the  Royal  Statistical  Society,  Series  B,  42,  109-142. 

McCullagh,  P.  and  Nelder,  J.  A.  (1989),  Generalized  Linear  Models,  New  York:  Chap- 
man and  Hall. 

McCulloch,  C.  E.  (1994),  “Maximum  Likelihood  Variance  Components  Estimation  for 
Binary  Data,”  Journal  of  the  American  Statistical  Association,  89,  330-335. 

McCulloch,  C.  E.  (1997),  “Maximum  Likelihood  Algorithms  for  Generalized  Linear 
Mixed  Models,”  Journal  of  the  American  Statistical  Association,  92,  162-170. 


263 


Meng,  X.  and  Rubin,  D.  (1993),  “Maximum  Likelihood  Estimation  via  the  ECM  Algo- 
rithm: A General  Framework,”  Biometrika , 80,  267-279. 

Natarajan,  R.  and  McCulloch,  C.  E.  (1995),  “A  Note  on  the  Existence  of  the  Posterior 
Distribution  for  a Class  of  Mixed  Models  for  Binomial  Responses,”  Biometrika , 82, 
639-643. 

Nelder,  J.  A.  and  Wedderburn,  R.  W.  M.  (1972),  “Generalized  Linear  Models,”  Journal 
of  the  Royal  Statistical  Society,  Series  A,  135,  370-384. 

Neuhaus,  J.  M.,  Hauck,  W.  W.,  and  Kalbfleisch,  J.  D.  (1992),  “The  Effects  of 
Mixture  Distribution  Misspecification  When  Fitting  Mixed-effects  Logistic  Models,” 
Biometrika , 79,  755-762. 

Neuhaus,  J.  M.,  Kalbfleisch,  J.  D.,  and  Hauck,  W.  W.  (1991),  “A  Comparison  of  Cluster- 
specific  and  Population-averaged  Approaches  for  Analyzing  Correlated  Binary  Data,” 
International  Statistical  Review , 59,  25-35. 

Pendergast,  J.  F.,  Gange,  S.  J.,  Newton,  M.  A.,  Lindstrom,  M.  J.,  Palta,  M.,  and  Fisher, 
M.  R.  (1996),  “A  Survey  of  Methods  for  Analyzing  Clustered  Binary  Response  Data,” 
International  Statistical  Review , 64,  89-118. 

Pierce,  D.  A.  and  Sands,  B.  R.  (1975),  “Extra-Bernoulli  Variation  in  Regression  of 
Binary  Data,”  Technical  Report  46,  Oregon  State  University,  Dept,  of  Statistics, 
Corvallis. 

Pinheiro,  J.  C.  and  Bates,  D.  M.  (1995),  “Approximations  to  the  Log-Likelihood  Func- 
tion in  the  Nonlinear  Mixed-effects  Model,”  Journal  or  Computational  and  Graphical 
Statistics,  4,  12-35. 

Prakasa  Rao,  B.  (1987),  Asymptotic  Theory  of  Statistical  Inference , New  York:  Wiley. 

Prakasa  Rao,  B.  (1992),  Identifiability  in  Stochastic  Models:  Characterization  of  Prob- 
ability Distributions,  New  York:  Academic  Press,  Inc. 

Prentice,  R.  L.  (1988),  “Correlated  Binary  Regression  with  Covariates  Specific  to  Each 
Binary  Observation,”  Biometrics , 44,  1033-1084. 

Price,  C.  J.,  Kimmel,  C.  A.,  Tyl,  R.  W.,  and  Marr,  M.  C.  (1985),  “The  Developmental 
Toxicity  of  Ethylene  Glycol  in  Rats  and  Mice,”  Toxicology  and  Applied  Pharmacology, 
81,  113-127. 

Randall,  J.  (1989),  “The  Analysis  of  Sensory  Data  by  Generalized  Linear  Models,” 
Biometrical  Journal,  31,  781-793. 

Rao,  C.  R.  (1973),  Linear  Statistical  Inference  and  its  Applications,  New  York:  Wiley, 
2nd  edition. 


264 


Rasch,  G.  (1961),  “On  General  Laws  and  the  Meaning  of  Measurement  in  Psychology,” 
In  Neyman,  J.,  editor,  Proceedings  of  the  fth  Berkeley  Symposium  on  Mathematical 
Statistics  and  Probability,  Vol  f,  Berkeley:  University  of  California  Press. 

Redner,  R.  and  Walker,  H.  (1984),  “Mixture  Densities,  Maximum  Likelihood,  and  the 
EM  Algorithm,”  Society  for  Industrial  and  Applied  Mathematics , 26,  195-239. 

Searle,  S.  R.,  Casella,  G.,  and  McCulloch,  C.  E.  (1992),  Variance  Components,  New 
York:  John  Wiley  & Sons,  Inc. 

Self,  S.  and  Liang,  K.-Y.  (1987),  “Asymptotic  Properties  of  Maximum  Likelihood 
Estimators  and  Likelihood  Ratio  Tests  Under  Nonstandard  Conditions,”  Journal  of 
the  American  Statistical  Association,  82,  605-610. 

Senn,  S.  (1998),  “Some  Controversies  in  Planning  and  Analysing  Multi-centre  Trials,” 
Statistics  in  Medicine,  17,  1753-1765. 

Shanno,  D.  (1970),  “Conditioning  of  Quasi-Newton  Methods  for  Function  Minimiza- 
tion,” Mathematics  of  Computation,  24,  647-657. 

Skene,  A.  M.  and  Wakefield,  J.  C.  (1990),  “Hierarchical  Models  for  Multicentre  Binary 
Response  Studies,”  Statistics  in  Medicine,  9,  919-929. 

Stiratelli,  R.,  Laird,  N.  M.,  and  Ware,  J.  H.  (1984),  “Random  Effects  Models  for  Serial 
Observations  with  Binary  Responses,”  Biometrics,  40,  961-971. 

Stroud,  A.  and  Secrest,  D.  (1966),  Gaussian  Quadrature  Formulas,  Englewood  Cliffs, 
New  Jersey:  Prentice  Hall. 

Swallow,  W.  and  Monahan,  J.  (1984),  “Monte  Carlo  Comparison  of  ANOVA,  MIVQUE, 
REML,  and  ML  Estimators  of  Variance  Components,”  Technometrics,  28,  47-57. 

Tanner,  M.  A.  (1993),  Tools  for  Statistical  Inference:  Observed  Data  and  Data  Aug- 
mentation (2nd  ed.),  Berlin:  Springer- Verlag. 

Tanner,  M.  A.  (1996),  Tools  for  Statistical  Inference:  Methods  for  the  Exploration  of 
Posterior  Distributions  and  Likelihood  Functions,  Berlin:  Springer- Verlag. 

Teicher,  H.  (1963),  “Indentifiability  of  Finite  Mixtures,”  Annals  of  Mathematical  Statis- 
tics, 34,  1265-1269. 

Teicher,  H.  (1967),  “Indentifiability  of  Mixtures  of  Product  Measures,”  Annals  of 
Mathematical  Statistics,  38,  1300-1302. 

Ten  Have,  T.  R.  (1996),  “A  Mixed  Effects  Model  for  Multivariate  Ordinal  Response 
Data  Including  Correlated  Discrete  Failure  Times  with  Ordinal  Responses,”  Biomet- 
rics, 52,  473-491. 


265 


Ten  Have,  T.  R.,  Landis,  J.  R.,  and  Hartzel,  J.  (1996),  “Population-Averaged  and 
Cluster-Specific  Models  for  Clustered  Ordinal  Response  Data,”  Statistics  in  Medicine, 
15,  2573-2588. 

Ten  Have,  T.  R.  and  Uttal,  D.  H.  (1994),  “Subject-Specific  and  Population-Averaged 
Continuation  Ratio  Logit  Models  for  Multiple  Discrete  Time  Survival  Profiles,”  Ap- 
plied Statistics,  32,  371-384. 

Thurstone,  L.  (1927),  “A  Law  of  Comparitive  Judgement,”  Psychological  Review,  34, 
273-286. 

Titterington,  D.,  Smith,  A.,  and  Makov,  U.  (1985),  Statistical  Analysis  of  Finite  Mixture 
Distributions,  New  York:  Wiley. 

Tjur,  T.  (1982),  “A  Connection  Between  Rasch’s  Item  Analysis  Model  and  a Multi- 
plicative Poisson  Model,”  Scandinavian  Journal  of  Statistics,  9,  23-30. 

Tutz,  G.  and  Hennevogl,  W.  (1996),  “Random  Effects  in  Ordinal  Regression  Models,” 
Computational  Statistics  & Data  Analysis,  22,  537-557. 

Uesaka,  H.  (1993),  “Test  for  Interaction  Between  Treatment  and  Stratum  with  Ordinal 
Responses,”  Biometrics,  49,  123-129. 

Wedderburn,  R.  W.  M.  (1974),  “Quasi-likelihood  Functions,  Generalized  Linear  Models, 
and  the  Gauss-Newton  Method,”  Biometrika , 61,  439-447. 

Wedel,  M.  and  DeSarbo,  W.  S.  (1995),  “A  Mixture  Likelihood  Approach  for  Generalized 
Linear  Models,”  Journal  of  Classification,  12,  21-55. 

Williams,  D.  A.  (1982),  “Extra-Binomial  Variation  in  Logistic  Linear  Models,”  Applied 
Statistics,  31,  144-148. 

Wolfinger,  R.  and  O’Connell,  M.  (1993),  “Generalized  Linear  Mixed  Models:  A Pseudo- 
likelihood Approach,”  Journal  of  Statistical  Computation  and  Simulation,  48,  233- 
243. 

Wolfram,  S.  (1996),  The  Mathematica  Book  (3rd  ed.),  New  York:  Wolfram  Me- 
dia/Cambridge University  Press. 

Wood,  A.  and  Hinde,  J.  (1987),  “Binomial  Variance  Component  Models  with  a Non- 
parametric  Assumption  Concerning  Random  Effects,”  In  Crouchley,  R.,  editor,  Lon- 
gitudinal Data  Analysis,  pages  110-128,  Hants,  England:  Avebury,  Aldershot. 

Zeger,  S.  and  Liang,  K.  Y.  (1986),  “Longitudinal  Data  Analysis  for  Discrete  and  Con- 
tinuous Outcomes,”  Biometrics,  42,  121-130. 

Zeger,  S.  L.  and  Karim,  M.  R.  (1991),  “Generalized  Linear  Models  With  Random  Ef- 
fects: A Gibbs  Sampling  Approach,”  Journal  of  the  American  Statistical  Association, 
86,  79-86. 


266 


Zeger,  S.  L.,  Liang,  K.-Y.,  and  Albert,  P.  S.  (1988),  “Models  for  Longitudinal  Data: 
A Generalized  Estimating  Equation  Approach  (Corr:  V45  P347),”  Biometrics,  44, 
1049-1060. 


BIOGRAPHICAL  SKETCH 

Jonathan  Seth  Hartzel  was  born  February  13,  1971,  in  Lansdale,  Pennsylvania, 
along  with  his  twin  sister  Kristin,  to  Norm  and  Judy  Hartzel,  and  their  older  brother 
Nathan.  Soon  after  birth,  Jonathan  moved  to  Telford,  Pennsylvania,  where  he  lived 
until  August  of  1989  when  he  left  for  college.  Both  Jonathan  and  Kristin  attended 
Messiah  College  in  Grantham,  Pennsylvania,  where  Jonathan  played  collegiate  soc- 
cer and  graduated  with  a Bachelor  of  Arts  degree  in  mathematics  in  June  of  1993. 
He  then  moved  to  Elizabethtown,  Pennsylvania,  where  he  worked  for  the  Center  for 
Biostatistics  and  Epidemiology  in  the  College  of  Medicine  of  Pennsylvania  State  Uni- 
versity. During  his  time  there,  he  met  his  future  wife,  Tracy  Ann  Plieninger.  In 
August  of  1994,  Jonathan  moved  to  Gainesville,  Florida,  and  began  graduate  school 
in  the  Department  of  Statistics  at  the  University  of  Florida. 

In  his  first  three  years  in  the  Department  of  Statistics,  Jonathan  worked  in  the 
Biostatistics  Consulting  Lab,  where  he  provided  consulting  and  statistical  support 
for  doctors  and  medical  students  in  Shands  Medical  Center.  During  his  final  years  in 
the  department,  Jonathan  has  worked  under  the  direction  of  his  advisor,  Dr.  Alan 
Agresti,  as  a research  assistant.  In  August  of  1995,  Jonathan  married  Tracy  in  Ore- 
land,  Pennsylvania,  following  which  Tracy  joined  Jonathan  in  Gainesville,  taking  em- 
ployment with  the  Department  of  Biostatistics  at  the  University  of  Florida.  Jonathan 
received  his  Master  of  Statistics  degree  in  December  of  1996  and  plans  to  receive  his 
Pli.D.  in  December  of  1999.  Jonathan  and  Tracy  look  forward  to  moving  to  Blue 
Bell,  Pennsylvania,  with  their  two  Rhodesian  Ridgebacks,  Riley  and  Kendi,  where 
Jonathan  has  accepted  a position  with  Merck  Research  Laboratories,  and  Tracy  has 
accepted  a position  with  Wyeth-Ayerst  pharmaceutical  company. 


267 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  accept- 
able standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Alan  G.  Agresti,  Chairman 
Professor  of  Statistics 

I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  accept- 
able standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Malay  Ghosh 
Distinguished  Professor  of  Statistics 

I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  accept- 
able standards  of  scholarly  presentation  and  is  fulfy  adequate,  in  scope  and  quality, 
as  a dissertation  for  the  degree  of  Doctor  of  PhiLo^opf 


3S  r.  Hobert 
distant  Professor  of  Statistics 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  accept- 
able standards  of  scholarly  presentation  and  is  fufl^adequate,  in  scope  and  quality, 
as  a dissertation  for  the  degree  of  Doctor  of  Ptyflosopfry. 


Ramon  C.  Littell 
Professor  of  Statistics 

I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  accept- 
able standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a dissertation  for  the  degree  of  Doctor  of  Philosoimy. 


Gary  Miller 
Associate  Professor  of  Mechanical 
Engineering 


This  dissertation  was  submitted  to  the  Graduate  Faculty  of  the  Department  of 
Statistics  in  the  College  of  Liberal  Arts  and  Sciences  and  to  the  Graduate  School  and 
was  accepted  as  partial  fulfillment  of  the  requirements  for  the  degree  of  Doctor  of 
Philosophy. 


December  1999 


Dean,  Graduate  School 


